Deferring seccomp decisions to user space
Seccomp, in its most flexible mode, allows user space to load a BPF program (still "classic" BPF, not the newer "extended" BPF) that has the opportunity to review every system call made by the controlled process. This program can choose to allow a call to proceed, or it can intervene by forcing a failure return or the immediate death of the process. These seccomp filters are known to be challenging to write for a number of reasons, even when the desired policy is simple.
Tycho Andersen, the author of the "seccomp trap to user space" patch set, sees a number of situations where the current mechanism falls short. His scenarios include allowing a container to load modules, create device nodes, or mount filesystems — with rigid controls applied. For example, creation of a /dev/null device would be allowed, but new block devices (or almost anything else) would not. Policies to allow this kind of action can be complex and site-specific; they are not something that would be easily implemented in a BPF program. But it might be possible to write something in user space that could handle decisions like these.
To enable this, Andersen's patch set adds a new return type for BPF programs (SECCOMP_RET_USER_NOTIF) that will cause the program making the call to be blocked while information about the call is sent to user space. A controlling program wanting to receive these notifications (and make decisions) must open a file descriptor by setting the SECCOMP_FILTER_FLAG_GET_LISTENER flag when loading the filter program. The returned file descriptor can then be polled for events; reading from it will return the next available notification signaled by the BPF filter.
Notifications, when read, are encoded in this structure:
struct seccomp_notif { __u64 id; pid_t pid; struct seccomp_data data; };
The returned id is a unique number identifying this event, pid is the ID of the process that triggered the notification, and data is the seccomp_data structure that was given to the BPF program describing the system call in progress:
struct seccomp_data { int nr; /* System call number */ __u32 arch; /* AUDIT_ARCH_* value (see <linux/audit.h>) */ __u64 instruction_pointer; /* CPU instruction pointer */ __u64 args[6]; /* Up to 6 system call arguments */ };
The user-space program can then meditate on whatever it is that the controlled program wishes to do. Note that the behavior of user notifications is similar to SECCOMP_RET_ERRNO, in that the system call itself will not be invoked in the context of the controlled process. So if the controlling process wants the system call to run in some form, it must do the work in its own context. When it has reached a decision (and done any needed work), it communicates that back to the kernel by filling in a seccomp_notif_resp structure and writing it back to the notification file descriptor:
struct seccomp_notif_resp { __u64 id; __s32 error; __s64 val; };
The id value must match that found in the original notification. error should be either zero or a negative error code; in the latter case, it will be negated and used as an error return from the system call that created the notification in the first place. If error is zero, then that system call will return successfully with val as its return value.
As a somewhat experimental addition, the final patch in the series adds two fields to the seccomp_notif_resp structure:
__u8 return_fd; __u32 fd;
These fields allow the control program to provide a file descriptor to be used as the return value from the system call; if return_fd is nonzero, fd will be passed to the controlled program. As Andersen notes, this mechanism will only work for system calls that are expected to return a file descriptor in the first place, but it's a starting point.
The protocol for the communication between the kernel and the control program has been the topic of some discussion in the past; in its current form, it will be difficult to extend when new features are (inevitably) added. Reviewers in the past have suggested using the netlink protocol instead, but that involves more complexity than the current implementation. Whether those reviewers will insist on that change before this code can be merged remains to be seen.
Overall, this patch series is another step in an interesting set of changes
that has been taking place. The boundary between the kernel and user space
was once a hard and well-defined line described by the system-call
interface. Increasingly, developers are working to make it possible for
users to move functionality across that line in both directions, both
putting policy into the kernel with BPF programs or moving it out with
various types of user-space helpers. As the computing environment changes,
it seems that this flexibility will be needed to ensure that Linux stays
relevant.
Index entries for this article | |
---|---|
Kernel | Security/seccomp |
Security | Linux kernel/Seccomp |
Posted Jun 2, 2018 12:54 UTC (Sat)
by TheJH (subscriber, #101155)
[Link] (1 responses)
Posted Jun 2, 2018 15:22 UTC (Sat)
by corbet (editor, #1)
[Link]
Posted Jun 2, 2018 13:04 UTC (Sat)
by brauner (subscriber, #109349)
[Link] (3 responses)
Posted Jun 3, 2018 13:11 UTC (Sun)
by jhoblitt (subscriber, #77733)
[Link] (2 responses)
Posted Sep 14, 2018 13:32 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
To not make this just a snark, I'll add a data point. I've seen git-lfs not want to fork out to Git for things like `remote get-url` and rather re-implement `insteadOf` and `pushInsteadOf` yet again. And so git-lfs is still broken with alias remote URLs that differ in push and pull. Attribute reading is also broken in the case of user "[attr]" attributes. Yes, both have issues filed (and I don't know Go (yet?) well enough to fix it myself).
I believe the *only* thing they fork for is to find out the version of Git used elsewhere. There might be one or two more instances as well, but they're of a similar level of actual functionality sharing.
Posted Sep 14, 2018 14:46 UTC (Fri)
by zlynx (guest, #2285)
[Link]
I do wish that the netlink formats were better documented.
Calling C code from Go causes all sorts of complex interactions with the green threads and garbage collection so it is not a good idea to casually link into CGo.
Posted Jun 2, 2018 16:42 UTC (Sat)
by skx (subscriber, #14652)
[Link] (1 responses)
I have to say I'm interested in seeing how this turns out - at least partially because I wrote a linux-security-module which defers checks for exec calls to user-space. The code is reasonably clean, and the overhead of having to exec a user-space binary is essentially unnoticed. The code is here: BPF has so many uses, and I'm loving the way it is becoming better documented, and more useful. I'm sure it is only a matter of time before it is invoked by linux-security modules.
Posted Jun 3, 2018 22:47 UTC (Sun)
by oscode (guest, #82250)
[Link]
Posted Jun 2, 2018 17:27 UTC (Sat)
by rvolgers (guest, #63218)
[Link] (1 responses)
I really wish we'd one day get a clean file descriptor based debugging API instead of the ptrace pseudo-reparenting and signal abuse nonsense.
Posted Jun 5, 2018 0:07 UTC (Tue)
by SEJeff (guest, #51588)
[Link]
Some light reading:
https://lwn.net/Articles/371210/
Posted Jun 2, 2018 19:14 UTC (Sat)
by smurf (subscriber, #17840)
[Link] (1 responses)
Posted Jun 2, 2018 20:07 UTC (Sat)
by TheJH (subscriber, #101155)
[Link]
Posted Jun 3, 2018 0:47 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
Perhaps a special case for strings could be added?
Posted Jun 3, 2018 1:49 UTC (Sun)
by dezgeg (subscriber, #92243)
[Link] (3 responses)
Posted Jun 3, 2018 3:11 UTC (Sun)
by roc (subscriber, #30627)
[Link]
Posted Jun 3, 2018 5:16 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Jun 3, 2018 13:21 UTC (Sun)
by jhoblitt (subscriber, #77733)
[Link]
I suspect that `statfs()` and `access()` are also frequent syscalls with string params. File distribution programs, such as HTTP servers, can produce a fairly extreme number of `access()` calls.
Posted Jun 4, 2018 21:37 UTC (Mon)
by wahern (subscriber, #37304)
[Link] (3 responses)
The solution is to copy the path or otherwise make it immutable. That's costly and it's why the the seccomp BPF filter originally didn't support processing the file path string. Has that changed?
Posted Jun 4, 2018 21:41 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
The helper code then can do all the required open/access/stat stuff and return the results as a file descriptor (open) or a static block of data (stat/access).
Obviously, copying the parameters will add some overhead, but it should be way less than doing additional ptrace/read_mem calls from the userspace helper.
Posted Jun 5, 2018 19:26 UTC (Tue)
by wahern (subscriber, #37304)
[Link] (1 responses)
Posted Jun 5, 2018 21:34 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jun 3, 2018 22:57 UTC (Sun)
by josh (subscriber, #17465)
[Link]
Deferring decisions to userspace?
That is a good point, something I didn't mention properly in the article. It behaves a lot like SECCOMP_RET_ERRNO. I have added a little text to try to fill that in, thanks.
Deferring decisions to userspace?
Deferring seccomp decisions to user space
at Plumbers last year it has seen rapid development thanks to Tycho. No one has really done
a lot of bikeshedding on it which is great!
It seems that people didn't really notice how much use cases this will enable once this is merged.
If I were one of gvisor guys I'd take a very close look at this patchset and whether it'd be possible
to kick out ptrace.
It's excellent that we've managed to decouple this from the ebpf seccomp patchset. The last step
is to hopefully not tie this to netlink as this looks like a lot of protocol for not much gain in this
case. But we'll see.
Deferring seccomp decisions to user space
Deferring seccomp decisions to user space
Deferring seccomp decisions to user space
Deferring seccomp decisions to user space
Deferring seccomp decisions to user space
Deferring seccomp decisions to user space
Deferring seccomp decisions to user space
https://yarchive.net/comp/linux/utrace.html
Deferring seccomp decisions to user space
Deferring seccomp decisions to user space
File paths?
File paths?
File paths?
File paths?
File paths?
File paths?
File paths?
File paths?
File paths?
Deferring seccomp decisions to user space