A crop of new capabilities

By Jonathan Corbet
June 8, 2020

Linux capabilities empower the holder to perform a set of specific privileged operations while withholding the full power of root access; see the capabilities man page for a list of current capabilities and what they control. There have been no capabilities added to the kernel since CAP_AUDIT_READ was merged for 3.16 in 2014. That's about to change with the 5.8 release, though, which is set to contain two new capabilities; yet another is currently under development.

New capabilities in 5.8

The first of the new capabilities is CAP_PERFMON, which was covered in detail here last February. With this capability, a user can perform performance monitoring, attach BPF programs to tracepoints, and other related actions. In current kernels, the catch-all CAP_SYS_ADMIN capability is required for this sort of performance monitoring; going forward, users can be given more restricted access. Of course, a process with CAP_SYS_ADMIN will still be able to do performance monitoring as well; it would be nice to remove that power from CAP_SYS_ADMIN, but doing so would likely break existing systems.

The other new capability, CAP_BPF, controls many of the actions that can be carried out with the bpf() system call. This capability has been the subject of a number of long and intense conversations over the last year; see this thread or this one for examples. The original idea was to provide a special device called /dev/bpf that would control access to BPF functionality, but that proposal did not get far. What was being provided was, in essence, a new capability, so capabilities seemed like a better solution.

The current CAP_BPF controls a number of BPF-specific operations, including the creation of BPF maps, use of a number of advanced BPF program features (bounded loops, cross-program function calls, etc.), access to BPF type format (BTF) data, and more. While the original plan was to not retain backward compatibility for processes holding CAP_SYS_ADMIN in an attempt to avoid what Alexei Starovoitov described as the "deprecated mess", the code that was actually merged does still recognize CAP_SYS_ADMIN.

One interesting aspect of CAP_BPF is that, on its own, it does not confer the power to do much that is useful. Crucially, it is still not possible to load most types of BPF programs with just CAP_BPF; to do that, a process must hold other capabilities relevant to the subsystem of interest. For example, programs for tracepoints, kprobes, or perf events can only be loaded if the process also holds CAP_PERFMON. Most program types related to networking (packet classifiers, XDP programs, etc.) require CAP_NET_ADMIN. If a user wants to load a program for a networking function that calls bpf_trace_printk(), then both CAP_NET_ADMIN and CAP_PERFMON are required. It is thus the combination of CAP_BPF with other capabilities that grants the ability to use BPF in specific ways.

Additionally, some BPF operations still require CAP_SYS_ADMIN. Offloading BPF programs into hardware is one example. Another one is iterating through BPF objects — programs, maps, etc. — to see what is loaded in the system. The ability to look up a map, for example, would give a process the ability to change maps belonging to other users and with it, the potential for all sorts of mayhem. Thus the bar for such activities is higher.

The end result of this work is that it will be possible to do quite a bit of network administration, performance monitoring, and tracing work without full root (or even full CAP_SYS_ADMIN) access.

CAP_RESTORE

The CAP_RESTORE proposal was posted in late May; its purpose is to allow the checkpointing and restoring of processes by (otherwise) unprivileged processes. Patch author Adrian Reber wrote that this is nearly possible today using the checkpoint/restore in user space (CRIU) feature that has been under development for many years. There are a few remaining obstacles, though, one of which is process IDs. Ideally, a process could be checkpointed and restored, perhaps on a different system, without even noticing that anything had happened. If the process's ID changes, though, that could be surprising and could lead to bad results. So the CRIU developers would like the ability to restore a process using the same ID (or IDs for a multi-threaded process) it had when it was checkpointed, assuming that the desired IDs are available, of course.

Setting the ID of a new process is possible with clone3(), but this feature is not available to unprivileged processes. The ability to create processes with a chosen ID would make a number of attacks easier, so ID setting is restricted to processes with, of course, CAP_SYS_ADMIN. Administrators tend to balk at handing out that capability, so CRIU users have been resorting to a number of workarounds; Reber listed a few that vary from the reasonable to the appalling:

Containers that can be put into user namespaces can, of course, control process IDs within their namespaces without any particular difficulty. But that is evidently not a solution for everybody.
Some high-performance computing users run CRIU by way of a setuid wrapper to gain the needed capabilities.
Some users run the equivalent of a fork bomb, quickly creating (and killing) processes to cycle through the process-ID space up to the desired value.
Java virtual-machine developers would evidently like to use CRIU to short out their long startup times; they have been simply patching out the CAP_SYS_ADMIN checks in their kernel (a workaround that led Casey Schaufler to exclaim: "That's not a workaround, it's a policy violation. Bad JVM! No biscuit!").

Reber reasonably suggested that it should be possible to find a better solution than those listed above, and said that CAP_RESTORE would be a good fit.

Discussion of this patch focused on a couple of issues, starting with whether it was needed at all. Schaufler, in particular, wanted to know what the new capability would buy, and whether it would truly be sufficient to carry out the checkpoint and restore operations without still needing CAP_SYS_ADMIN. Just splitting something out of CAP_SYS_ADMIN, he said, is not useful by itself:

If we broke out CAP_SYS_ADMIN properly we'd have hundreds of capabilities, and no one would be able to manage the capability sets on anything. Just breaking out of CAP_SYS_ADMIN, especially if the process is going to need other capabilities anyway, gains you nothing.

It does seem that CAP_RESTORE may, in the end, be sufficient for this task, though, so Schaufler's objections seemed to fade over time.

The other question that came up was: what other actions would eventually be made possible with this new capability? The patch hinted at others, but they were not implemented. The main one appears to be the ability to read the entries in /proc/pid/map_files in order to be able to properly dump out various mappings during the checkpoint procedure. The next version of the patch will have an implementation of this behavior as well. Some developers wondered whether there should be two new capabilities, with the second being CAP_CHECKPOINT, to cover the actions specific to each procedure; that change may not happen without further pressure, though.

The final form of this patch remains to be seen; security-related changes can require a lot of discussion and modification before they find their way in. But this capability seems useful enough that it will probably end up merged in some form at some point.

Index entries for this article
Kernel	BPF/Security
Kernel	Capabilities
Kernel	Checkpointing
Security	Capabilities
Security	Linux kernel/Linux/POSIX capabilities

A crop of new capabilities

Posted Jun 8, 2020 20:38 UTC (Mon) by roc (subscriber, #30627) [Link]

I wonder what the use-cases are for CRIU that require putting a migrated process in the same namespaces as other non-migrated processes.

When it's OK to run the migrated process in its own namespaces, i.e. a container, there isn't really a problem because the migrated process can have (or at least start with) CAP_SYS_ADMIN in those namespaces.

A crop of new capabilities

Posted Jun 8, 2020 23:37 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (7 responses)

> Some users run the equivalent of a fork bomb, quickly creating (and killing) processes to cycle through the process-ID space up to the desired value.

Wait, what???

Are people actually doing that in production, or is it one of those "gee willikers, look what I can do" GitHub repositories?

A crop of new capabilities

Posted Jun 9, 2020 19:28 UTC (Tue) by kevincox (guest, #93938) [Link]

At a minimum it seems like a reasonable demonstration that requiring special privileges to perform this is not an effective mitigation.

A crop of new capabilities

Posted Jun 11, 2020 16:17 UTC (Thu) by Jandar (subscriber, #85683) [Link] (5 responses)

I have long running (multiple months) screen sessions on some production servers to which I connect frequently. To connect to screen you have to specify the unique beginning of the screens pid. All pids beginning with a higher digit as the begin of max-pid are 10 times rarer than those with a lower digit. If I get a screen with pid 9* it's normally sufficient to type 'screen -r 9' to reconnect.

Do you have to make many guesses about what I do when starting the screen? ;-)

A crop of new capabilities

Posted Jun 11, 2020 18:13 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (4 responses)

A true hacker's hack. Or you could use tmux and the you get access to named sessions/sockets ;) .

This probably works much better when pids are small (when uptime is low). Now to get such a pid after you already have pid 1000000 spawned, your process comet probably takes a non-negligible amount of time. And then you're stuck with another million or so pids until it is unique again.

A crop of new capabilities

Posted Jun 11, 2020 20:41 UTC (Thu) by Jandar (subscriber, #85683) [Link]

> This probably works much better when pids are small (when uptime is low).

The uptime doesn't factor in. The production systems of our customers mostly have max-pid of 64k with a high uptime. The pid wraps around after a few hours to a few days. So if someone other starts a screen from the shared admin-account it gets a pid nearly at random. The pids starting from 1 to 5 are from a set of 11111 numbers (disregarding special pids like 1) and pids starting from 7 to 9 are from a set of 1111.

while [[ $(readlink /proc/self) != 9* ]]; do :; done && screen -S Jandar

As normally at most a handful screen are running, the chance to have 9* exclusively for myself is high because no workmate uses the shell-prompt to type in loops. Probability with 5 other screen = (1-1111/2^16)^5 = .91807.

A crop of new capabilities

Posted Jun 11, 2020 23:24 UTC (Thu) by kmweber (guest, #114635) [Link] (2 responses)

> Or you could use tmux and the you get access to named sessions/sockets

You can do that with screen, too.

A crop of new capabilities

Posted Jun 12, 2020 2:24 UTC (Fri) by Jandar (subscriber, #85683) [Link] (1 responses)

> You can do that with screen, too.

I was going to say: it doesn't work that way, but I tested it before commenting. In reality it does work.

The only explanation I have is, that the manual shows contradicting usages.

1) from the Synopsis:
screen -r [[pid.]tty[.host]]
2) from the section COMMAND-LINE OPTIONS:
-r [pid.tty.host]
-r sessionowner/[pid.tty.host]
In 2) the pid seems to be not optional and the sessionname from -S sessionname
only substitutes tty.host. The text description for -r says "prefix of [pid.]tty.host" so pid seems optional.

I have never tested if screen -r sessionname works, I've always assumed pid is mandatory. *facepalm*

A crop of new capabilities

Posted Jun 16, 2020 13:55 UTC (Tue) by geert (subscriber, #98403) [Link]

I use "screen -dRR -S <sessionname> ..." all the time, from a script that knows how to connect to whatever target board I specify.

There is one caveat though: if a session name is an abbreviation of another session name, and the session with the shorter name doesn't exist yet, then screen will happily connect to the (wrong) session with the longer name instead. Once the session has been created, everything works as expected, though.
Solution: create your sessions in the right order, so they never match an existing one.

A crop of new capabilities

Posted Jun 9, 2020 0:15 UTC (Tue) by nickodell (subscriber, #125165) [Link] (2 responses)

>Setting the ID of a new process is possible with clone3(), but this feature is not available to unprivileged processes. The ability to create processes with a chosen ID would make a number of attacks easier, so ID setting is restricted to processes with, of course, CAP_SYS_ADMIN. Administrators tend to balk at handing out that capability, so CRIU users have been resorting to a number of workarounds; Reber listed a few that vary from the reasonable to the appalling:
So what are you supposed to do if some other process starts using that PID?

I don't get what the issue is with using a PID namespace. It seems like it fixes both the permissions issue and the collision issue.

A crop of new capabilities

Posted Jun 9, 2020 0:24 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Each PID namespace starts PIDs from 1. So this shouldn't be an issue.

A crop of new capabilities

Posted Jun 10, 2020 4:05 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

Obviously, you're supposed to send them a SIGSEGV with kill(2), and hope whoever's responsible for that binary never figures out their impossible-to-reproduce segfault bug.

I mean, that's almost as sensible as some of the other suggestions in the article, right?

A crop of new capabilities

Posted Jun 11, 2020 2:55 UTC (Thu) by hendry (guest, #50859) [Link] (1 responses)

I don't quite understand how binaries are distributed by Linux distributions with these capabilities.

Are any distros leveraging this?

A crop of new capabilities

Posted Jun 11, 2020 7:09 UTC (Thu) by zdzichu (subscriber, #17118) [Link]

For example, RPM packages define metadata about files shipped in the package. Owner, permission, ACL, xattrs, capabilities etc. When package is installed, all those attributes are set to match.
This look like following in RPM .spec file:

%attr(0755,root,root) %caps(cap_net_raw=p) %{_bindir}/arping

This way “arping” command can be run without special privileges. Some more information: https://fedoraproject.org/wiki/Features/RemoveSETUID