A crop of new capabilities
New capabilities in 5.8
The first of the new capabilities is CAP_PERFMON, which was covered in detail here last February. With this capability, a user can perform performance monitoring, attach BPF programs to tracepoints, and other related actions. In current kernels, the catch-all CAP_SYS_ADMIN capability is required for this sort of performance monitoring; going forward, users can be given more restricted access. Of course, a process with CAP_SYS_ADMIN will still be able to do performance monitoring as well; it would be nice to remove that power from CAP_SYS_ADMIN, but doing so would likely break existing systems.
The other new capability, CAP_BPF, controls many of the actions that can be carried out with the bpf() system call. This capability has been the subject of a number of long and intense conversations over the last year; see this thread or this one for examples. The original idea was to provide a special device called /dev/bpf that would control access to BPF functionality, but that proposal did not get far. What was being provided was, in essence, a new capability, so capabilities seemed like a better solution.
The current CAP_BPF controls a number of BPF-specific operations,
including the creation of BPF maps, use of a number of advanced BPF program
features (bounded loops, cross-program
function calls, etc.), access to BPF type format
(BTF) data, and more.
While the original plan was to not retain backward compatibility for
processes holding CAP_SYS_ADMIN in an attempt to avoid what Alexei
Starovoitov described
as the "deprecated mess
", the code that was actually merged does
still recognize CAP_SYS_ADMIN.
One interesting aspect of CAP_BPF is that, on its own, it does not confer the power to do much that is useful. Crucially, it is still not possible to load most types of BPF programs with just CAP_BPF; to do that, a process must hold other capabilities relevant to the subsystem of interest. For example, programs for tracepoints, kprobes, or perf events can only be loaded if the process also holds CAP_PERFMON. Most program types related to networking (packet classifiers, XDP programs, etc.) require CAP_NET_ADMIN. If a user wants to load a program for a networking function that calls bpf_trace_printk(), then both CAP_NET_ADMIN and CAP_PERFMON are required. It is thus the combination of CAP_BPF with other capabilities that grants the ability to use BPF in specific ways.
Additionally, some BPF operations still require CAP_SYS_ADMIN. Offloading BPF programs into hardware is one example. Another one is iterating through BPF objects — programs, maps, etc. — to see what is loaded in the system. The ability to look up a map, for example, would give a process the ability to change maps belonging to other users and with it, the potential for all sorts of mayhem. Thus the bar for such activities is higher.
The end result of this work is that it will be possible to do quite a bit of network administration, performance monitoring, and tracing work without full root (or even full CAP_SYS_ADMIN) access.
CAP_RESTORE
The CAP_RESTORE proposal was posted in late May; its purpose is to allow the checkpointing and restoring of processes by (otherwise) unprivileged processes. Patch author Adrian Reber wrote that this is nearly possible today using the checkpoint/restore in user space (CRIU) feature that has been under development for many years. There are a few remaining obstacles, though, one of which is process IDs. Ideally, a process could be checkpointed and restored, perhaps on a different system, without even noticing that anything had happened. If the process's ID changes, though, that could be surprising and could lead to bad results. So the CRIU developers would like the ability to restore a process using the same ID (or IDs for a multi-threaded process) it had when it was checkpointed, assuming that the desired IDs are available, of course.
Setting the ID of a new process is possible with clone3(), but this feature is not available to unprivileged processes. The ability to create processes with a chosen ID would make a number of attacks easier, so ID setting is restricted to processes with, of course, CAP_SYS_ADMIN. Administrators tend to balk at handing out that capability, so CRIU users have been resorting to a number of workarounds; Reber listed a few that vary from the reasonable to the appalling:
- Containers that can be put into user namespaces can, of course, control process IDs within their namespaces without any particular difficulty. But that is evidently not a solution for everybody.
- Some high-performance computing users run CRIU by way of a setuid wrapper to gain the needed capabilities.
- Some users run the equivalent of a fork bomb, quickly creating (and killing) processes to cycle through the process-ID space up to the desired value.
- Java virtual-machine developers would evidently like to use CRIU to
short out their long startup times; they have been simply patching out
the CAP_SYS_ADMIN checks in their kernel (a workaround that
led Casey Schaufler to exclaim:
"
That's not a workaround, it's a policy violation. Bad JVM! No biscuit!
").
Reber reasonably suggested that it should be possible to find a better solution than those listed above, and said that CAP_RESTORE would be a good fit.
Discussion of this patch focused on a couple of issues, starting with whether it was needed at all. Schaufler, in particular, wanted to know what the new capability would buy, and whether it would truly be sufficient to carry out the checkpoint and restore operations without still needing CAP_SYS_ADMIN. Just splitting something out of CAP_SYS_ADMIN, he said, is not useful by itself:
It does seem that CAP_RESTORE may, in the end, be sufficient for this task, though, so Schaufler's objections seemed to fade over time.
The other question that came up was: what other actions would eventually be made possible with this new capability? The patch hinted at others, but they were not implemented. The main one appears to be the ability to read the entries in /proc/pid/map_files in order to be able to properly dump out various mappings during the checkpoint procedure. The next version of the patch will have an implementation of this behavior as well. Some developers wondered whether there should be two new capabilities, with the second being CAP_CHECKPOINT, to cover the actions specific to each procedure; that change may not happen without further pressure, though.
The final form of this patch remains to be seen; security-related changes
can require a lot of discussion and modification before they find their way
in. But this capability seems useful enough that it will probably end up
merged in some form at some point.
Index entries for this article | |
---|---|
Kernel | BPF/Security |
Kernel | Capabilities |
Kernel | Checkpointing |
Security | Capabilities |
Security | Linux kernel/Linux/POSIX capabilities |
Posted Jun 8, 2020 20:38 UTC (Mon)
by roc (subscriber, #30627)
[Link]
When it's OK to run the migrated process in its own namespaces, i.e. a container, there isn't really a problem because the migrated process can have (or at least start with) CAP_SYS_ADMIN in those namespaces.
Posted Jun 8, 2020 23:37 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (7 responses)
Wait, what???
Are people actually doing that in production, or is it one of those "gee willikers, look what I can do" GitHub repositories?
Posted Jun 9, 2020 19:28 UTC (Tue)
by kevincox (guest, #93938)
[Link]
Posted Jun 11, 2020 16:17 UTC (Thu)
by Jandar (subscriber, #85683)
[Link] (5 responses)
Do you have to make many guesses about what I do when starting the screen? ;-)
Posted Jun 11, 2020 18:13 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (4 responses)
This probably works much better when pids are small (when uptime is low). Now to get such a pid after you already have pid 1000000 spawned, your process comet probably takes a non-negligible amount of time. And then you're stuck with another million or so pids until it is unique again.
Posted Jun 11, 2020 20:41 UTC (Thu)
by Jandar (subscriber, #85683)
[Link]
The uptime doesn't factor in. The production systems of our customers mostly have max-pid of 64k with a high uptime. The pid wraps around after a few hours to a few days. So if someone other starts a screen from the shared admin-account it gets a pid nearly at random. The pids starting from 1 to 5 are from a set of 11111 numbers (disregarding special pids like 1) and pids starting from 7 to 9 are from a set of 1111.
while [[ $(readlink /proc/self) != 9* ]]; do :; done && screen -S Jandar
As normally at most a handful screen are running, the chance to have 9* exclusively for myself is high because no workmate uses the shell-prompt to type in loops. Probability with 5 other screen = (1-1111/2^16)^5 = .91807.
Posted Jun 11, 2020 23:24 UTC (Thu)
by kmweber (guest, #114635)
[Link] (2 responses)
You can do that with screen, too.
Posted Jun 12, 2020 2:24 UTC (Fri)
by Jandar (subscriber, #85683)
[Link] (1 responses)
I was going to say: it doesn't work that way, but I tested it before commenting. In reality it does work.
The only explanation I have is, that the manual shows contradicting usages.
1) from the Synopsis:
I have never tested if screen -r sessionname works, I've always assumed pid is mandatory. *facepalm*
Posted Jun 16, 2020 13:55 UTC (Tue)
by geert (subscriber, #98403)
[Link]
There is one caveat though: if a session name is an abbreviation of another session name, and the session with the shorter name doesn't exist yet, then screen will happily connect to the (wrong) session with the longer name instead. Once the session has been created, everything works as expected, though.
Posted Jun 9, 2020 0:15 UTC (Tue)
by nickodell (subscriber, #125165)
[Link] (2 responses)
I don't get what the issue is with using a PID namespace. It seems like it fixes both the permissions issue and the collision issue.
Posted Jun 9, 2020 0:24 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jun 10, 2020 4:05 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
I mean, that's almost as sensible as some of the other suggestions in the article, right?
Posted Jun 11, 2020 2:55 UTC (Thu)
by hendry (guest, #50859)
[Link] (1 responses)
Are any distros leveraging this?
Posted Jun 11, 2020 7:09 UTC (Thu)
by zdzichu (subscriber, #17118)
[Link]
%attr(0755,root,root) %caps(cap_net_raw=p) %{_bindir}/arping
This way “arping” command can be run without special privileges. Some more information: https://fedoraproject.org/wiki/Features/RemoveSETUID
A crop of new capabilities
A crop of new capabilities
A crop of new capabilities
A crop of new capabilities
A crop of new capabilities
A crop of new capabilities
A crop of new capabilities
A crop of new capabilities
screen -r [[pid.]tty[.host]]
2) from the section COMMAND-LINE OPTIONS:
-r [pid.tty.host]
-r sessionowner/[pid.tty.host]
In 2) the pid seems to be not optional and the sessionname from -S sessionname
only substitutes tty.host. The text description for -r says "prefix of [pid.]tty.host" so pid seems optional.
A crop of new capabilities
Solution: create your sessions in the right order, so they never match an existing one.
A crop of new capabilities
So what are you supposed to do if some other process starts using that PID?
A crop of new capabilities
A crop of new capabilities
A crop of new capabilities
A crop of new capabilities
This look like following in RPM .spec file: