Grabbing file descriptors with pidfd_getfd()

By Jonathan Corbet
January 9, 2020

In response to a growing desire for ways to control groups of processes from user space, the kernel has added a number of mechanisms that allow one process to operate on another. One piece that is currently missing, though, is the ability for a process to snatch a copy of an open file descriptor from another. That gap may soon be filled, though, if the pidfd_getfd() system-call patch set from Sargun Dhillon is merged.

One thing that is possible in current kernels is to open a file that another process also has open; the information needed to do that is in each process's /proc directory. That does not work, though, for file descriptors referring to pipes, sockets, or other objects that do not appear in the filesystem hierarchy. Just as importantly, though, opening a new file in this way creates a new entry in the file table; it is not the entry corresponding to the file descriptor in the process of interest.

That distinction matters if the objective is to modify that particular file descriptor. One use case mentioned in the patch series is using seccomp to intercept attempts to bind a socket to a privileged port. A privileged supervisor process could, if it so chose, grab the file descriptor for that socket from the target process and actually perform the bind — something the target process would not have the privilege to do on its own. Since the grabbed file descriptor is essentially identical to the original, the bind operation will be visible to the target process as well.

For the sufficiently determined, it is actually possible to extract a file descriptor from another process now. The technique involves using ptrace() to attach to that process, stop it from executing, inject some code that opens a connection to the supervisor process and sends the file descriptor via an SCM_RIGHTS datagram, then running that code. This solution might justly be said to be slightly lacking in elegance. It also requires stopping the target process, which is likely to be unwelcome.

This functionality, without the need to stop the target process, is relatively easy to implement in the kernel, though; a supervisor process would merely need to make a call to:

    int pidfd_getfd(int pidfd, int targetfd, unsigned int flags);

The target process is specified by pidfd (which is, as one might expect, a pidfd, presumably obtained when the process was created). The file descriptor to grab is given by targetfd; if all goes well, the return value will be a local file-descriptor number corresponding to the target process's file. For all to go well, the calling process must have the ability to call ptrace() on the target process.

The flags argument is currently unused and must be zero. There are, evidently, plans to add flags in the future, though. One would cause the file descriptor to be closed in the target process after being copied to the caller, thus truly "stealing" the descriptor from the target. Another would remove any related control-group data from socket file descriptors during the copy operation.

This patch set has been through an impressive number of versions — and a fair amount of evolution — since it was first posted on December 5. The initial version added a new PTRACE_GETFD command to ptrace(). Version 3 switched to an ioctl() operation on a pidfd instead. In version 5, fifteen days after the initial posting, this functionality moved into a separate system call. The current posting is version 9.

From the beginning there has not been much concern about the goals behind this feature; the comments have mostly focused on the implementation. At this point, Dhillon would appear to have just about exhausted the set of possible implementations — though some might be justified in thinking that a BPF version in the near future is inevitable. Failing that, this new system call may well be on track for the 5.6 or 5.7 merge window.

Index entries for this article
Kernel	System calls

Grabbing file descriptors with pidfd_getfd()

Posted Jan 9, 2020 18:44 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (6 responses)

> That distinction matters if the objective is to modify that particular file descriptor. One use case mentioned in the patch series is using seccomp to intercept attempts to bind a socket to a privileged port. A privileged supervisor process could, if it so chose, grab the file descriptor for that socket from the target process and actually perform the bind — something the target process would not have the privilege to do on its own. Since the grabbed file descriptor is essentially identical to the original, the bind operation will be visible to the target process as well.
>
> For the sufficiently determined, it is actually possible to extract a file descriptor from another process now. The technique involves using ptrace() to attach to that process, stop it from executing, inject some code that opens a connection to the supervisor process and sends the file descriptor via an SCM_RIGHTS datagram, then running that code. This solution might justly be said to be slightly lacking in elegance. It also requires stopping the target process, which is likely to be unwelcome.

On first read, I found this rather confusing. Surely the sandboxed process would be able to open that AF_UNIX connection itself, right?

But no, because they're not talking about a sandboxed process that is cooperating with the supervisor. They're (I think) talking about a sandboxed process that is ignorant of its sandbox and thinks it can "just call bind(2)." In that case, you actually need to intercept that call and emulate it outside the sandbox, without the sandboxed process noticing.

What bothers me most, however, is that this still feels like an antiquated system design. In the great before-times, inetd would spawn your server with the socket already hooked to stdin, and you wouldn't need to think about calling bind() or indeed any part of the sockets interface. While there are obvious scalability concerns with that approach, I still believe that binding sockets (to well known ports) ought to be something that is handled by system infrastructure and not separately by each individual server.

Grabbing file descriptors with pidfd_getfd()

Posted Jan 9, 2020 19:04 UTC (Thu) by Karellen (subscriber, #67644) [Link] (3 responses)

I still believe that binding sockets (to well known ports) ought to be something that is handled by system infrastructure and not separately by each individual server.

So, like sd_listen_fds()?

Grabbing file descriptors with pidfd_getfd()

Posted Jan 10, 2020 13:39 UTC (Fri) by miquels (guest, #59247) [Link] (2 responses)

Or things like authbind and innbind ?

Grabbing file descriptors with pidfd_getfd()

Posted Jan 10, 2020 14:29 UTC (Fri) by Karellen (subscriber, #67644) [Link] (1 responses)

Thanks for pointing to those!

However, I'd have reservations about using authbind - LD_PRELOAD is handy for debugging and trying weird tricks out, but I'm wary about using it in production systems.

innbind looks much cleaner, and certainly would allow you to write a program that could bind to privileged ports without needing to run as root, but as far as I can tell it allows any program on the system to bind privileged ports. If you installed it so that only members of a specific group were able to run it, and limited which programs ran as members of that group, that could work.

Grabbing file descriptors with pidfd_getfd()

Posted Jan 10, 2020 15:14 UTC (Fri) by nix (subscriber, #2304) [Link]

innbind is usually installed mode 1550, group news, so it's only executable by things in the Usenet news subsystem, which are all in the same trust domain.

Grabbing file descriptors with pidfd_getfd()

Posted Jan 11, 2020 23:12 UTC (Sat) by rra (subscriber, #99804) [Link] (1 responses)

To ask what's probably the same question in a slightly different way: is the rule that only root can bind to ports below 1024 still useful?

Back when that was added to UNIX's security model, there were a wealth of programs that used the ability to bind to specific ports as an authorization control of various kinds (remember identd?). Most of those protocols are thoroughly obsolete (I hope no one is using traditional rlogin with rhosts authentication these days), so protecting those ports doesn't serve the same purpose.

I would argue that, today, the security concern is preventing programs from grabbing ports they're not "supposed" to have, but that problem is not limited to ports under 1024 except by history and convention. There are a lot of services that listen to ports above 1024 where some race condition allowing a user process to bind to that port is equally problematic.

It feels like a more useful security primitive now would be controlling the specific ports to which a process can bind, which looks more like socket activation (as you describe), or like a container where the process can bind to any port it wants but only expected ports are routed outside the container, so binding to other ports is futile.

Grabbing file descriptors with pidfd_getfd()

Posted Jan 13, 2020 9:24 UTC (Mon) by cortana (subscriber, #24596) [Link]

It may be interesting to note another alternative: a Mandatory Access Control system such as SELinux, where confined processes are only allowed to bind to ports permitted by the policy (e.g., Apache running in the http_t domain can only listen to ports labelled with httpd_port_t).

Grabbing file descriptors with pidfd_getfd()

Posted Jan 9, 2020 18:57 UTC (Thu) by zblaxell (subscriber, #26385) [Link] (2 responses)

> One would cause the file descriptor to be closed in the target process after being copied to the caller, thus truly "stealing" the descriptor from the target.

That sounds messy--the FD could end up being used again by an open in
some other thread of the target process, causing hilarious confusion on
the target side if the target is not expecting FD thievery.

Why not do an atomic FD swap?

int stolen_fd = pidfd_swapfd(int pid_fd, int target_fd, int flags, int caller_fd)

Set caller_fd = NOFD if you really want the FD closed in the target process;
otherwise, the caller's caller_fd becomes the target's target_fd, while the
former target's target_fd is returned in stolen_fd.

Set target_fd = NOFD to copy caller_fd to the target process, assigning
a new FD as if the target process had performed an open(). The new FD
number in the target is returned in stolen_fd.

caller_fd isn't closed in the calling process--close() is fine for that.

Grabbing file descriptors with pidfd_getfd()

Posted Jul 27, 2021 8:13 UTC (Tue) by taladar (subscriber, #68407) [Link] (1 responses)

Closing a file descriptor from another process would also circumvent any static analysis a language (like Haskell or Rust) might have done to ensure certain operations are only done on open file descriptors.

Grabbing file descriptors with pidfd_getfd()

Posted Jul 27, 2021 15:03 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

Meddling from "outside" is likely to interfere with guarantees made by any language. I think once you introduce an "interloper" into your process space (either via ptrace, VM managers, OOM killer, interrupts, etc.), you're playing with fire. Sure, we know how to manage it most of the time and can keep it contained, but if it gets loose…well, I hope you have insurance[1].

[1] In code, that would be "sanity checks on top of the language guarantees". IMO, it's just normal defensive coding and the amount you put in depends on how paranoid you tend (or need) to be.

Grabbing file descriptors with pidfd_getfd()

Posted Jan 9, 2020 21:23 UTC (Thu) by roc (subscriber, #30627) [Link]

This sounds great. We have code to do this in rr already:
https://github.com/mozilla/rr/blob/79eea40fe0d496abb6fcb0...
It's not nice, especially because we want it to work whether the tracee is 64-bit or 32-bit.

Of course it will be years before the new syscall is widely deployed enough that we can actually rip out our code, but ... progress.

Grabbing file descriptors with pidfd_getfd()

Posted Jan 10, 2020 20:53 UTC (Fri) by kylebot (guest, #134772) [Link] (1 responses)

If I remember correctly, one process can send file descriptors through sendmsg syscall?
Then what's the difference between these two methods.

Grabbing file descriptors with pidfd_getfd()

Posted Jan 10, 2020 22:30 UTC (Fri) by cyphar (subscriber, #110703) [Link]

Using sendmsg(2) requires co-operation from the other side (or the injection of parasitic code a-la CRIU or rr). Those approaches are really suboptimal for a bunch of reasons, and having an interface which does this properly and doesn't require shellcode injection as part of normal code execution is a massive benefit. Not to mention that seccomp filters on the target process may block some of the syscalls needed for that to work.

Grabbing file descriptors with pidfd_getfd()

Posted Jan 14, 2020 14:52 UTC (Tue) by dona73110 (guest, #113155) [Link] (1 responses)

>One thing that is possible in current kernels is to open a file that another process also has open; the information needed to do that is in each process's /proc directory. That does not work, though, for file descriptors referring to pipes, sockets, or other objects that do not appear in the filesystem hierarchy.

You sure can open a pipe that another process has open, by opening /proc/PID/fd/FD ... open(2) opens the actual files that these symlinks represent, which in the case of deleted files or pipes, etc, do not correspond to the path in the symlink target returned by readlink.

Grabbing file descriptors with pidfd_getfd()

Posted Jan 29, 2020 1:35 UTC (Wed) by cyphar (subscriber, #110703) [Link]

You're right about how magic-links work (and re-opening through /proc/$pid/fd does work for pipes), but this does not work for sockets or anonfds -- you'll get -ENXIO when you try to re-open them. Additionally, there is still a pid recycling race condition if you use procfs (unless you have a first-generation /proc/$pid-style pidfd).