Deferring seccomp decisions to user space

By Jonathan Corbet
June 2, 2018

There has been a lot of work in recent years to use BPF to push policy decisions into the kernel. But sometimes, it seems, what is really wanted is a way for a BPF program to punt a decision back to user space. That is the objective behind this patch set giving the secure computing (seccomp) mechanism a way to pass complex decisions to a user-space helper program.

Seccomp, in its most flexible mode, allows user space to load a BPF program (still "classic" BPF, not the newer "extended" BPF) that has the opportunity to review every system call made by the controlled process. This program can choose to allow a call to proceed, or it can intervene by forcing a failure return or the immediate death of the process. These seccomp filters are known to be challenging to write for a number of reasons, even when the desired policy is simple.

Tycho Andersen, the author of the "seccomp trap to user space" patch set, sees a number of situations where the current mechanism falls short. His scenarios include allowing a container to load modules, create device nodes, or mount filesystems — with rigid controls applied. For example, creation of a /dev/null device would be allowed, but new block devices (or almost anything else) would not. Policies to allow this kind of action can be complex and site-specific; they are not something that would be easily implemented in a BPF program. But it might be possible to write something in user space that could handle decisions like these.

To enable this, Andersen's patch set adds a new return type for BPF programs (SECCOMP_RET_USER_NOTIF) that will cause the program making the call to be blocked while information about the call is sent to user space. A controlling program wanting to receive these notifications (and make decisions) must open a file descriptor by setting the SECCOMP_FILTER_FLAG_GET_LISTENER flag when loading the filter program. The returned file descriptor can then be polled for events; reading from it will return the next available notification signaled by the BPF filter.

Notifications, when read, are encoded in this structure:

    struct seccomp_notif {
	__u64 id;
	pid_t pid;
	struct seccomp_data data;
    };

The returned id is a unique number identifying this event, pid is the ID of the process that triggered the notification, and data is the seccomp_data structure that was given to the BPF program describing the system call in progress:

    struct seccomp_data {
	int   nr;                   /* System call number */
        __u32 arch;                 /* AUDIT_ARCH_* value
                                       (see <linux/audit.h>) */
        __u64 instruction_pointer;  /* CPU instruction pointer */
        __u64 args[6];              /* Up to 6 system call arguments */
    };

The user-space program can then meditate on whatever it is that the controlled program wishes to do. Note that the behavior of user notifications is similar to SECCOMP_RET_ERRNO, in that the system call itself will not be invoked in the context of the controlled process. So if the controlling process wants the system call to run in some form, it must do the work in its own context. When it has reached a decision (and done any needed work), it communicates that back to the kernel by filling in a seccomp_notif_resp structure and writing it back to the notification file descriptor:

    struct seccomp_notif_resp {
	__u64 id;
	__s32 error;
	__s64 val;
    };

The id value must match that found in the original notification. error should be either zero or a negative error code; in the latter case, it will be negated and used as an error return from the system call that created the notification in the first place. If error is zero, then that system call will return successfully with val as its return value.

As a somewhat experimental addition, the final patch in the series adds two fields to the seccomp_notif_resp structure:

	__u8 return_fd;
	__u32 fd;

These fields allow the control program to provide a file descriptor to be used as the return value from the system call; if return_fd is nonzero, fd will be passed to the controlled program. As Andersen notes, this mechanism will only work for system calls that are expected to return a file descriptor in the first place, but it's a starting point.

The protocol for the communication between the kernel and the control program has been the topic of some discussion in the past; in its current form, it will be difficult to extend when new features are (inevitably) added. Reviewers in the past have suggested using the netlink protocol instead, but that involves more complexity than the current implementation. Whether those reviewers will insist on that change before this code can be merged remains to be seen.

Overall, this patch series is another step in an interesting set of changes that has been taking place. The boundary between the kernel and user space was once a hard and well-defined line described by the system-call interface. Increasingly, developers are working to make it possible for users to move functionality across that line in both directions, both putting policy into the kernel with BPF programs or moving it out with various types of user-space helpers. As the computing environment changes, it seems that this flexibility will be needed to ensure that Linux stays relevant.

Index entries for this article
Kernel	Security/seccomp
Security	Linux kernel/Seccomp

Deferring decisions to userspace?

Posted Jun 2, 2018 12:54 UTC (Sat) by TheJH (subscriber, #101155) [Link] (1 responses)

The article is titled "Deferring seccomp decisions to user space". As far as I can tell, the referenced patchset doesn't actually defer the whole decision; it allows userspace to synchronously handle the syscall and provide a return value, but userspace can't decide to just let the syscall through, it can only emulate it.

Deferring decisions to userspace?

Posted Jun 2, 2018 15:22 UTC (Sat) by corbet (editor, #1) [Link]

That is a good point, something I didn't mention properly in the article. It behaves a lot like SECCOMP_RET_ERRNO. I have added a little text to try to fill that in, thanks.

Deferring seccomp decisions to user space

Posted Jun 2, 2018 13:04 UTC (Sat) by brauner (subscriber, #109349) [Link] (3 responses)

This is a much needed patchset and I'm really happy that since the first design discussions
at Plumbers last year it has seen rapid development thanks to Tycho. No one has really done
a lot of bikeshedding on it which is great!
It seems that people didn't really notice how much use cases this will enable once this is merged.
If I were one of gvisor guys I'd take a very close look at this patchset and whether it'd be possible
to kick out ptrace.
It's excellent that we've managed to decouple this from the ebpf seccomp patchset. The last step
is to hopefully not tie this to netlink as this looks like a lot of protocol for not much gain in this
case. But we'll see.

Deferring seccomp decisions to user space

Posted Jun 3, 2018 13:11 UTC (Sun) by jhoblitt (subscriber, #77733) [Link] (2 responses)

The "gain" of using netlink is a standard client lib, such as libnl, could be used instead of every service having a custom interface with semantics that evolve differently than other kernel interfaces over time. Imagine what the state of interoperability would be if most "ReSTful" web APIs used a custom serialization format instead of JSON?

Deferring seccomp decisions to user space

Posted Sep 14, 2018 13:32 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (1 responses)

Well, that'd work if Go projects weren't so intent on not using *any* non-Go code in their stacks…</snark>

To not make this just a snark, I'll add a data point. I've seen git-lfs not want to fork out to Git for things like `remote get-url` and rather re-implement `insteadOf` and `pushInsteadOf` yet again. And so git-lfs is still broken with alias remote URLs that differ in push and pull. Attribute reading is also broken in the case of user "[attr]" attributes. Yes, both have issues filed (and I don't know Go (yet?) well enough to fix it myself).

I believe the *only* thing they fork for is to find out the version of Git used elsewhere. There might be one or two more instances as well, but they're of a similar level of actual functionality sharing.

Deferring seccomp decisions to user space

Posted Sep 14, 2018 14:46 UTC (Fri) by zlynx (guest, #2285) [Link]

I implemented a Go netlink reader for connection tracking. It wasn't hard for the most part (I do wish someone had explicitly written a few comments about data alignment instead of making it implicitly hidden in macros if I remember correctly).

I do wish that the netlink formats were better documented.

Calling C code from Go causes all sorts of complex interactions with the green threads and garbage collection so it is not a good idea to casually link into CGo.

Deferring seccomp decisions to user space

Posted Jun 2, 2018 16:42 UTC (Sat) by skx (subscriber, #14652) [Link] (1 responses)

I have to say I'm interested in seeing how this turns out - at least partially because I wrote a linux-security-module which defers checks for exec calls to user-space. The code is reasonably clean, and the overhead of having to exec a user-space binary is essentially unnoticed.

The code is here:

https://github.com/skx/linux-security-modules/tree/master/security/can-exec

BPF has so many uses, and I'm loving the way it is becoming better documented, and more useful. I'm sure it is only a matter of time before it is invoked by linux-security modules.

Deferring seccomp decisions to user space

Posted Jun 3, 2018 22:47 UTC (Sun) by oscode (guest, #82250) [Link]

Thanks for sharing! Your LSM projects look interesting, it's just a shame they can't be dynamically loaded.

Deferring seccomp decisions to user space

Posted Jun 2, 2018 17:27 UTC (Sat) by rvolgers (guest, #63218) [Link] (1 responses)

This seems really nice for the seccomp usecase, but it does kind of put the spotlight on how awkward ptrace is in comparison.

I really wish we'd one day get a clean file descriptor based debugging API instead of the ptrace pseudo-reparenting and signal abuse nonsense.

Deferring seccomp decisions to user space

Posted Jun 5, 2018 0:07 UTC (Tue) by SEJeff (guest, #51588) [Link]

We almost had this (a better ptrace, no userspace api ontop of it) with utrace, but Andrew Morton (ultimatey) shot it down and Linus didn't like it. This caused Roland Mcgrath to stop working on utrace / uprobes almost entirely.

Some light reading:

https://lwn.net/Articles/371210/
https://yarchive.net/comp/linux/utrace.html

Deferring seccomp decisions to user space

Posted Jun 2, 2018 19:14 UTC (Sat) by smurf (subscriber, #17840) [Link] (1 responses)

Wouldn't handling of these calls be a whole lot easier if there was a way to tell the monitored program to proceed with the syscall in question? I'd assume that calls like open() or exec() on behalf of the tracee are a major PITA to do correctly – in other words: a security hole in waiting.

Deferring seccomp decisions to user space

Posted Jun 2, 2018 20:07 UTC (Sat) by TheJH (subscriber, #101155) [Link]

But doing that reasonably safely (without race conditions) is a big PITA, especially if the sandboxed process is multithreaded. If you look at the path argument of an open() call and use that to determine whether the call should be allowed, it's probably safest to do the actual open() in the supervisor process and then install the resulting FD in the sandboxed process.

File paths?

Posted Jun 3, 2018 0:47 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

How does it access the file paths? I guess the filter can just ptrace the requesting process, but that's already piling up overhead on top of the overhead.

Perhaps a special case for strings could be added?

File paths?

Posted Jun 3, 2018 1:49 UTC (Sun) by dezgeg (subscriber, #92243) [Link] (3 responses)

Presumably with process_vm_readv(). Still a user/kernel context switch per string but still much, much better than ptrace()...

File paths?

Posted Jun 3, 2018 3:11 UTC (Sun) by roc (subscriber, #30627) [Link]

rr uses /proc/<pid>/mem to read tracee memory instead of PTRACE_PEEKUSER, even though it's already ptracing, because the former is so much faster. I assume gdb does too.

File paths?

Posted Jun 3, 2018 5:16 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Still, a special case for automatic transmission of string arguments might make sense. open/stat calls are probably 90% of security-related calls and special-casing their arguments might give a sizable performance benefit.

File paths?

Posted Jun 3, 2018 13:21 UTC (Sun) by jhoblitt (subscriber, #77733) [Link]

I wonder if anyone has gathered statics on syscall distribution for various types of workload?

I suspect that `statfs()` and `access()` are also frequent syscalls with string params. File distribution programs, such as HTTP servers, can produce a fairly extreme number of `access()` calls.

File paths?

Posted Jun 4, 2018 21:37 UTC (Mon) by wahern (subscriber, #37304) [Link] (3 responses)

Isn't that susceptible to a race condition? systrace (https://en.wikipedia.org/wiki/Systrace) never saw widespread adoption exactly because of the race condition, both on Linux and on OpenBSD (with an in-kernel implementation). The TOCTTOU race is that a signal handler or thread changes the path between the check and the actual open.

The solution is to copy the path or otherwise make it immutable. That's costly and it's why the the seccomp BPF filter originally didn't support processing the file path string. Has that changed?

File paths?

Posted Jun 4, 2018 21:41 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

No, there's no race condition. The kernel code would have to copy strings into the message sent to the userspace helper.

The helper code then can do all the required open/access/stat stuff and return the results as a file descriptor (open) or a static block of data (stat/access).

Obviously, copying the parameters will add some overhead, but it should be way less than doing additional ptrace/read_mem calls from the userspace helper.

File paths?

Posted Jun 5, 2018 19:26 UTC (Tue) by wahern (subscriber, #37304) [Link] (1 responses)

Is that how it works _now_? Is any of that work already in place?

File paths?

Posted Jun 5, 2018 21:34 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Right now BPF syscall filter programs can't access the string arguments at all, so there's no problem.

Deferring seccomp decisions to user space

Posted Jun 3, 2018 22:57 UTC (Sun) by josh (subscriber, #17465) [Link]

I find myself curious if this could be used to emulate non-existent system calls, or even invent an entirely new syscall interface with arbitrary syscall numbers. The userspace program receives the syscall number and arguments; it could do anything with those.