process_madvise(), pidfd capabilities, and the revenge of the PIDs

By Jonathan Corbet
January 21, 2020

Once upon a time, there were few ways for one process to operate upon another after its creation; sending signals and ptrace() were about it. In recent years, interest in providing ways for processes to control others has been on the increase, and the kernel's process-management API has been expanded accordingly. Along these lines, the process_madvise() system call has been proposed as a way for one process to influence how memory management is done in another. There is a new process_madvise() series which is interesting in its own right, but this series has also raised a couple of questions about how process management should be improved in general.

The existing madvise() system call allows a process to make suggestions to the kernel about how its address space should be managed. The 5.4 kernel saw a couple of new types of advice that could be provided with madvise(): MADV_COLD and MADV_PAGEOUT. The former requests that the kernel place the indicated range of pages onto the inactive list, essentially saying that they have not been used in a long time. Those pages will thus be among the first considered for reclaim if the kernel needs memory for other purposes. MADV_PAGEOUT, instead, is a stronger statement that the indicated pages are no longer needed; it will cause them to be reclaimed immediately.

These new requests are useful for processes that know what their future access patterns will be. But it seems that in certain environments — Android, in particular — processes lack that knowledge, but the management system does know when certain memory ranges are no longer needed. The bulk of a process's address space could be marked as MADV_COLD when that process is moved out of the foreground, for example. In such settings, letting one process call madvise() on behalf of another helps the system as a whole make the best use of its memory resources. That is the purpose behind the process_madvise() proposal.

When Minchan Kim posted a new process_madvise() series in early January, the API looked like this:

    int process_madvise(int pidfd, void *addr, size_t length, int advice,
    			unsigned long flags);

The effect of this call would be the same as if the process identified by pidfd (which, as one might guess from the name, is a pidfd rather than a process ID) had called madvise() with the given addr, length, and advice arguments; the flags argument is not currently used. Only a subset of madvise() comments (MADV_COLD, MADV_PAGEOUT, MADV_MERGEABLE, and MADV_UNMERGEABLE) is supported; others can be added as the use cases for them emerge.

It seemed like most of the discussion on process_madvise() had run its course by now, and that there were few obstacles to its path into the mainline kernel. But, then, a couple of issues came up.

Pidfd capabilities

One of those doesn't directly affect process_madvise(), but does offer an interesting look into where the pidfd API is headed in general. In the current patch sets, the question of whether one process is allowed to call process_madvise() on another is answered with the usual "would ptrace() be allowed?" test. That comes down to either running under the same user ID or having the CAP_SYS_PTRACE capability. This test is standard practice and uncontroversial, but Christian Brauner, the developer behind most of the pidfd work, has another idea in mind:

When you create a pidfd, e.g. with clone3() and you'd wanted to use it for madvise you'd need to set a flag like pidfd_cap_madvise or pidfd_feature_madvise when you create the pidfd. Only if the pidfd was created with that flag set could you use it with madvise.

In essence, under this scheme, a pidfd would include a capability mask of its own stating what could be done with it. It is analogous to how ordinary file descriptors work: one can only write to a file descriptor if it was created with a flag that allows writing. Brauner requested that process_madvise() not be merged until 5.7 so that it could use this (not yet implemented) capability mechanism from the beginning.

There is an interesting question about how pidfd capabilities would work, brought up by Daniel Colascione: would having a pidfd with a given capability flag be sufficient to carry out a specific action, or would the traditional tests apply too? In other words, if a process holds a pidfd that was created with the ability to pass it to process_madvise(), would that process still need to pass the ptrace() test for the action to succeed?

Brauner admitted to "going back and forth" on that question; he said, though, that his inclination was to still require the privilege to execute the operation independently from the capability flag on the pidfd. So those capabilities (or the lack thereof) would serve only to restrict operations that would otherwise be allowed. Colascione argued in favor of pidfd capabilities actually granting access, though, and Brauner agreed to look into it. Patches, he said, would be posted soon.

PIDs: not dead yet

One other characteristic of the proposed process_madvise() API is that the caller must possess a pidfd to use it — unlike other system calls like kill() or setpriority(), which take process IDs. That is not uncommon for new, process-oriented system calls; pidfds are the cool new kid on the block and they appear to be generally preferred. A pidfd unambiguously identifies a process; a process ID, instead, can vary from one namespace to the next and, should the target process exit, might be reused for an entirely unrelated process. For such reasons, pidfds have quickly become the favored way of identifying processes in new system calls; as Colascione put it: "All new APIs should use pidfds: they're better than numeric PIDs in every way".

It turns out, though, that not everybody agrees with that point of view. Process IDs are often already known, while a pidfd would have to be created; PIDs can also be specified by users on a command line, while pidfds cannot. Kirill Tkhai was one of the dissenters:

Ordinary pid interfaces also should be available. There are a lot of cases, when they are more comfortable. Say, a calling of process_madvise() from tracer, when a tracee is stopped. In this moment the tracer knows everything about tracee state, and pidfd brackets pidfd_open() and close() around actual action look just stupid, and this is cpu time wasting.

Thus, he argued, every new process API should be able to handle both PIDs and pidfds. Kim took this request to heart, as can be seen in a new version of the process_madvise() patch set posted one week later. The API for this proposed system call now is:

    int process_madvise(int which, pid_t pid, void *addr, size_t length,
    			int advice, unsigned long flag);

The new which parameter would be either P_PID or P_PIDFD to tell the kernel how the pid argument should be interpreted.

This change highlights a question that needs to be answered by the wider community. If there is a consensus that all new process-related system calls should support both ways of identifying processes, then this convention should really be applied universally and consistently to all new interfaces. Otherwise it perhaps should not be used even for process_madvise(). Creating a mix of APIs, some of which accept only one way while others support both, seems like the worst outcome. The Linux system-call API is inconsistent enough as it is now; future developers will not be grateful if new system calls make that situation worse.

Index entries for this article
Kernel	Memory management
Kernel	pidfd
Kernel	System calls

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 21, 2020 12:29 UTC (Tue) by dskoll (subscriber, #1630) [Link] (5 responses)

Why not take this to its logical extreme?

execute_as_process(int pidfd, void (*func)());

I am of course not completely serious and realize the difficulty of implementing this as well as the security implications, but it seems we are approaching this asymptotically.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 21, 2020 13:26 UTC (Tue) by rvolgers (guest, #63218) [Link] (1 responses)

Or, as windows calls it, CreateRemoteThreadEx.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 21, 2020 14:46 UTC (Tue) by dskoll (subscriber, #1630) [Link]

Hah! I have never programmed on Windows and know nothing about its API, and feel a little ashamed for having rediscovered that. :)

Thanks for the info.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 21, 2020 21:17 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (2 responses)

Well... you *could* have something like this:

long process_syscall(int pidfd, long number, ...)

It would behave as-if process had invoked syscall(2) with the remaining arguments. Maybe you also whitelist number to syscalls that actually make sense to invoke remotely, and are unlikely to cause massive reentrancy or threading issues.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 22, 2020 2:42 UTC (Wed) by roc (subscriber, #30627) [Link] (1 responses)

There are many states that thread could be in that would be problematic. Most of the time the thread would be blocked in the kernel for some reason, in which case you'd want to queue the syscall for execution when it would next return to userspace, but that might never happen.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 23, 2020 19:03 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

It was mostly a tongue-in-cheek suggestion, and I'm pretty sure actually doing this would be a Bad Idea. But if you really wanted to do it, you could probably use the signals API to deal with those issues. That is:

Create a new SIGFOO value, and make all of the sigaction() et al. functions accept it.
Whenever someone calls process_syscall(), the target process receives a SIGFOO.
The default behavior of the SIGFOO is to call syscall(...) as if from a signal handler.
If the signal is masked, handled, or ignored, it behaves as you would expect (syscall() is not invoked).
(Optional) Masking, handling, or ignoring the signal requires a capability and/or UID=0, so that containers cannot veto the actions of their supervisors. Alternatively, it can't be masked, handled, or ignored at all, so that it behaves like SIGSTOP and SIGKILL.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 21, 2020 14:28 UTC (Tue) by Paf (subscriber, #91811) [Link] (4 responses)

I am very curious to see how this gets worked out, my instinct is that the desire for the switchable interface implemented in the calls is deeply silly. Unless there are significant performance or other (security?) implications, this desire for an alternate interface should be solved with wrappers and/or macros, not have a switching argument baked in to the basic call.

That is to say, there’s nothing silly about wanting both interfaces, but having a “what is this other argument” switch argument in the call, when not strictly necessary, seems... yuck? Perhaps opinions differ :)

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 21, 2020 15:45 UTC (Tue) by cyphar (subscriber, #110703) [Link] (3 responses)

The counter-argument is waitid(2), which has basically the exact same interface.

Additionally, doing the switching in user-space isn't all that fun. The syscall has to take pidfds, otherwise there's no point to permitting pidfds to be used with the interface (getting the pid of a pidfd is painful, but it also immediately becomes susceptible to pid recycling attacks -- the thing pidfds were meant to block). And that would annoy the people who are unhappy with requiring pidfds for new syscalls (they don't want to take up file handles, and it's likely that for their programs creating the pidfd is a meaningless gesture because they have no way of actually being sure the pid is correct).

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 21, 2020 17:35 UTC (Tue) by josh (subscriber, #17465) [Link] (1 responses)

waitid has a flag to avoid having to create a new version of the syscall. It already accepted flags, so adding a flag to accept a pidfd in place of the existing pid argument made some sense.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 22, 2020 6:46 UTC (Wed) by cyphar (subscriber, #110703) [Link]

Sure, waitid(2) obviously wasn't written with pidfds in mind and adding them later was to avoid making a new syscall -- my point was that the resulting interface is identical to the one being proposed (even the proposed constants -- P_PID and P_PIDFD -- are the same):

int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options);

int process_madvise(int which, pid_t pid, void *addr, size_t length, int advice, unsigned long flag);

Where @which is equivalent to @idtype. Thus, it is arguably only as ugly as the waitid(2) interface. Also, they didn't re-use a flag argument -- waitid(2) explicitly had "type switching" from the outset (though it was intended to be used to differentiate between process groups and PIDs).

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 21, 2020 22:49 UTC (Tue) by Paf (subscriber, #91811) [Link]

That’s an interesting point about the PID uncertainty for those programs, thanks. Though at the same time, the pidfd isn’t any *worse* than the questionable pid... hmm.

This might be a situation where the kernel commitment to backwards compatibility implicitly pushes a less elegant interface. (I say implicitly because there is no explicit backwards compatibility issue here, but I think the philosophy arguably applies because not having both PIDs and pidfds implicitly pushes people to the new interface.)

Hmm.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 21, 2020 15:50 UTC (Tue) by cyphar (subscriber, #110703) [Link] (5 responses)

> In this moment the tracer knows everything about tracee state, and pidfd brackets pidfd_open() and close() around actual action look just stupid, and this is cpu time wasting.

While I do understand wanting to maintain support for PIDs in newer syscalls (after all, in some cases you only get a PID from a user or other program), I don't think that a tracer program would be written in the way described. It's far more likely that the tracer would already have a pidfd open for each process it is tracing. But then again, since ptrace (and ptrace-related syscalls) doesn't use pidfds, it would also be fair to say that the interface mismatch would make the code ugly no matter what.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 21, 2020 19:32 UTC (Tue) by rvolgers (guest, #63218) [Link] (4 responses)

If ever there was a good use case for pidfd's it's ptrace. That is one ugly interface, with all the signal magic and pseudo-reparenting.

Of course, porting it would take quite a lot of effort probably, nevermind deprecating the old interface so all the cruft can be removed, but one can dream.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 21, 2020 19:52 UTC (Tue) by roc (subscriber, #30627) [Link]

Maybe hack pidfds into ptrace and other syscalls by passing "-pidfd" as the pid?

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 21, 2020 19:58 UTC (Tue) by roc (subscriber, #30627) [Link] (2 responses)

I think pidfds for ptrace() would be good, but I don't think they immediately solve the major issues with ptrace.

I would really like the ability to hand-off ptrace control to other processes by passing them a pidfd, but that would require lots more work. Something like:
* Make sure pidfd_wait or whatever can read the special ptrace status events.
* When pidfds are used, break the "ptrace parent" relationship in the kernel so *any* process with a pidfd for the tracee can ptrace() it or get the ptrace status events. (I bet this is *really* hard.)
But it would make much-requested rr features, like the ability to start debugging an in-progress rr recording without interrupting it, much more tractable.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 21, 2020 22:52 UTC (Tue) by Paf (subscriber, #91811) [Link] (1 responses)

Why do you think breaking the ptrace parent relationship would be so hard?

I’m not really familiar with that part of ptrace, but I have had to look at the signal handling dance and it is a *mess*. (Not incorrect in any way, just... messy)

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 22, 2020 2:44 UTC (Wed) by roc (subscriber, #30627) [Link]

The current ptrace code is designed around that relationship, and I assume that any significant changes to ptrace code are going to be hard. I'd love to be wrong!

“which”?

Posted Jan 21, 2020 21:12 UTC (Tue) by mirabilos (subscriber, #84359) [Link] (4 responses)

This breaks types, though.

“which” is an int, sure, but the PID is of type pid_t, while pidfds as file descriptors are of type int.

I’d rather have it…

int process_madvise(int pidfd, pid_t pid, …

… and use “pid” iff pidfd == -1 (which is the usual closed/invalid fd number).

“which”?

Posted Jan 22, 2020 12:37 UTC (Wed) by NAR (subscriber, #1313) [Link] (3 responses)

That could be confusing - the pidfd variable would be used not only to store a pidfd, but to select which interface is called. Also, this interface would invite bugs where both the pidfd and pid would be set, maybe even to different processes. Documentation could make it explicit, however, programmers have the habit to not read the documentation and write code based on assumption created by reading the API.

My idea was to use the flags argument to select between pid and pidfd - but I guess that flag should be the same as accepted by madvise. If only C had function overloading... But it doesn't so maybe bite the bullet and create two functions: madvise_by_pid and madvise_by_pidfd.

“which”?

Posted Jan 22, 2020 15:20 UTC (Wed) by mirabilos (subscriber, #84359) [Link] (2 responses)

> the pidfd variable

No, there’s no extra variable, this is just a parameter.

You’d either call it with…

process_madvise(pidfd, /* ignored */ 0, …)

… or with…

process_madvise(-1, pid, …)

… so this question never comes up.

> bugs where both the pidfd and pid would be set

This is a very common interface, and the answer is trivial, as I stated in the comment above: pid is used iff pidfd == -1 (meaning if it’s not -1, pid will be ignored). This is a basic standard technique.

“which”?

Posted Jan 22, 2020 16:59 UTC (Wed) by james (subscriber, #1325) [Link] (1 responses)

Perhaps the call should return an error if the pid is non-zero and pidfd isn't -1.

“which”?

Posted Jan 22, 2020 17:19 UTC (Wed) by mirabilos (subscriber, #84359) [Link]

No, why?

It’s really common practice to do things this way (maybe you don’t know this as you seem to be a C++ programmer, but in the C/UNIX world, we do).

Take, for example, mmap, when called with MAP_ANONYMOUS in flags, ignores the fd argument instead of checking it.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 22, 2020 1:25 UTC (Wed) by KaiRo (subscriber, #1987) [Link] (2 responses)

I have no clue about kernel-level coding, but why not have a function like pidfd_from_pid(pid) that would create/take a pidfd for a given pid (with all the ambivalence a pid has) and which you then hand over to new calls that only support a pidfd? From the command line that should be fine and other code should see to convert to pidfd in the longer run anyhow.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Jan 22, 2020 6:41 UTC (Wed) by cyphar (subscriber, #110703) [Link] (1 responses)

That already exists with the pidfd_open(2) syscall, the concern is that it's needless overhead (and a meaningless gesture) for userspace to have to create a pidfd if they are just going to use it as though it were a PID (with all of the issues related to it). That's what Kirill Tkhai was referring to when they wrote:

> In this moment the tracer knows everything about tracee state, and pidfd brackets pidfd_open() and close() around actual action look just stupid, and this is cpu time wasting.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Posted Feb 3, 2020 20:56 UTC (Mon) by nix (subscriber, #2304) [Link]

This is a very silly argument, though. You can easily wrap the pidfd_open/pidfd-operation/_close in a function, and as for CPU time wastage -- two transitions to kernel space will likely be utterly minor compared to the large number of transitions that almost all pidfd operations are likely to incur; and if they only incur one, then the tripling of CPU time is *still* utterly irrelevant unless they're being called on literally millions of processes -- in which case one must wonder what on earth they are doing, and whether they should be using some other mechanism based around pgids or cgroups or something so they don't have to do something so ridiculously inefficient as calling *any* one syscall millions of times on millions of foreign processes.