process_madvise(), pidfd capabilities, and the revenge of the PIDs
The existing madvise() system call allows a process to make suggestions to the kernel about how its address space should be managed. The 5.4 kernel saw a couple of new types of advice that could be provided with madvise(): MADV_COLD and MADV_PAGEOUT. The former requests that the kernel place the indicated range of pages onto the inactive list, essentially saying that they have not been used in a long time. Those pages will thus be among the first considered for reclaim if the kernel needs memory for other purposes. MADV_PAGEOUT, instead, is a stronger statement that the indicated pages are no longer needed; it will cause them to be reclaimed immediately.
These new requests are useful for processes that know what their future access patterns will be. But it seems that in certain environments — Android, in particular — processes lack that knowledge, but the management system does know when certain memory ranges are no longer needed. The bulk of a process's address space could be marked as MADV_COLD when that process is moved out of the foreground, for example. In such settings, letting one process call madvise() on behalf of another helps the system as a whole make the best use of its memory resources. That is the purpose behind the process_madvise() proposal.
When Minchan Kim posted a new process_madvise() series in early January, the API looked like this:
int process_madvise(int pidfd, void *addr, size_t length, int advice, unsigned long flags);
The effect of this call would be the same as if the process identified by pidfd (which, as one might guess from the name, is a pidfd rather than a process ID) had called madvise() with the given addr, length, and advice arguments; the flags argument is not currently used. Only a subset of madvise() comments (MADV_COLD, MADV_PAGEOUT, MADV_MERGEABLE, and MADV_UNMERGEABLE) is supported; others can be added as the use cases for them emerge.
It seemed like most of the discussion on process_madvise() had run its course by now, and that there were few obstacles to its path into the mainline kernel. But, then, a couple of issues came up.
Pidfd capabilities
One of those doesn't directly affect process_madvise(), but does offer an interesting look into where the pidfd API is headed in general. In the current patch sets, the question of whether one process is allowed to call process_madvise() on another is answered with the usual "would ptrace() be allowed?" test. That comes down to either running under the same user ID or having the CAP_SYS_PTRACE capability. This test is standard practice and uncontroversial, but Christian Brauner, the developer behind most of the pidfd work, has another idea in mind:
In essence, under this scheme, a pidfd would include a capability mask of its own stating what could be done with it. It is analogous to how ordinary file descriptors work: one can only write to a file descriptor if it was created with a flag that allows writing. Brauner requested that process_madvise() not be merged until 5.7 so that it could use this (not yet implemented) capability mechanism from the beginning.
There is an interesting question about how pidfd capabilities would work, brought up by Daniel Colascione: would having a pidfd with a given capability flag be sufficient to carry out a specific action, or would the traditional tests apply too? In other words, if a process holds a pidfd that was created with the ability to pass it to process_madvise(), would that process still need to pass the ptrace() test for the action to succeed?
Brauner admitted
to "going back and forth
" on that question; he said, though,
that his inclination was to still require the privilege to execute the
operation independently from the capability flag on the pidfd. So those
capabilities (or the lack thereof) would serve only to restrict operations
that would otherwise be allowed. Colascione argued
in favor of pidfd capabilities actually granting access, though, and
Brauner agreed
to look into it. Patches, he said, would be posted soon.
PIDs: not dead yet
One other characteristic of the proposed process_madvise() API is
that the caller must possess a pidfd to use it — unlike other system calls
like kill()
or setpriority(),
which take process IDs. That is not uncommon for new, process-oriented
system calls; pidfds are the cool new kid on the block and they appear to
be generally preferred. A pidfd unambiguously identifies a process; a
process ID, instead, can vary from one namespace to the next and, should
the target process exit, might be reused for an entirely unrelated process.
For such reasons, pidfds have quickly become the favored way of identifying
processes in new system calls; as Colascione put
it: "All new APIs should use pidfds: they're better than numeric
PIDs in every way
".
It turns out, though, that not everybody agrees with that point of view. Process IDs are often already known, while a pidfd would have to be created; PIDs can also be specified by users on a command line, while pidfds cannot. Kirill Tkhai was one of the dissenters:
Thus, he argued, every new process API should be able to handle both PIDs and pidfds. Kim took this request to heart, as can be seen in a new version of the process_madvise() patch set posted one week later. The API for this proposed system call now is:
int process_madvise(int which, pid_t pid, void *addr, size_t length, int advice, unsigned long flag);
The new which parameter would be either P_PID or P_PIDFD to tell the kernel how the pid argument should be interpreted.
This change highlights a question that needs to be answered by the wider
community. If there is a consensus that all new process-related system
calls should support both ways of identifying processes, then this
convention should really be applied universally and consistently to all new
interfaces. Otherwise it perhaps
should not be used even for process_madvise(). Creating a mix of
APIs, some of which accept only one way while others support both,
seems like the worst outcome. The Linux system-call API is inconsistent
enough as it is now; future developers will not be grateful if new system
calls make that situation worse.
Index entries for this article | |
---|---|
Kernel | Memory management |
Kernel | pidfd |
Kernel | System calls |
Posted Jan 21, 2020 12:29 UTC (Tue)
by dskoll (subscriber, #1630)
[Link] (5 responses)
Why not take this to its logical extreme?
execute_as_process(int pidfd, void (*func)());
I am of course not completely serious and realize the difficulty of implementing this as well as the security implications, but it seems we are approaching this asymptotically.
Posted Jan 21, 2020 13:26 UTC (Tue)
by rvolgers (guest, #63218)
[Link] (1 responses)
Posted Jan 21, 2020 14:46 UTC (Tue)
by dskoll (subscriber, #1630)
[Link]
Hah! I have never programmed on Windows and know nothing about its API, and feel a little ashamed for having rediscovered that. :)
Thanks for the info.
Posted Jan 21, 2020 21:17 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
Well... you *could* have something like this:
It would behave as-if process had invoked syscall(2) with the remaining arguments. Maybe you also whitelist number to syscalls that actually make sense to invoke remotely, and are unlikely to cause massive reentrancy or threading issues.
Posted Jan 22, 2020 2:42 UTC (Wed)
by roc (subscriber, #30627)
[Link] (1 responses)
Posted Jan 23, 2020 19:03 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link]
It was mostly a tongue-in-cheek suggestion, and I'm pretty sure actually doing this would be a Bad Idea. But if you really wanted to do it, you could probably use the signals API to deal with those issues. That is:
Posted Jan 21, 2020 14:28 UTC (Tue)
by Paf (subscriber, #91811)
[Link] (4 responses)
That is to say, there’s nothing silly about wanting both interfaces, but having a “what is this other argument” switch argument in the call, when not strictly necessary, seems... yuck? Perhaps opinions differ :)
Posted Jan 21, 2020 15:45 UTC (Tue)
by cyphar (subscriber, #110703)
[Link] (3 responses)
Additionally, doing the switching in user-space isn't all that fun. The syscall has to take pidfds, otherwise there's no point to permitting pidfds to be used with the interface (getting the pid of a pidfd is painful, but it also immediately becomes susceptible to pid recycling attacks -- the thing pidfds were meant to block). And that would annoy the people who are unhappy with requiring pidfds for new syscalls (they don't want to take up file handles, and it's likely that for their programs creating the pidfd is a meaningless gesture because they have no way of actually being sure the pid is correct).
Posted Jan 21, 2020 17:35 UTC (Tue)
by josh (subscriber, #17465)
[Link] (1 responses)
Posted Jan 22, 2020 6:46 UTC (Wed)
by cyphar (subscriber, #110703)
[Link]
int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options);
vs
int process_madvise(int which, pid_t pid, void *addr, size_t length, int advice, unsigned long flag);
Where @which is equivalent to @idtype. Thus, it is arguably only as ugly as the waitid(2) interface. Also, they didn't re-use a flag argument -- waitid(2) explicitly had "type switching" from the outset (though it was intended to be used to differentiate between process groups and PIDs).
Posted Jan 21, 2020 22:49 UTC (Tue)
by Paf (subscriber, #91811)
[Link]
This might be a situation where the kernel commitment to backwards compatibility implicitly pushes a less elegant interface. (I say implicitly because there is no explicit backwards compatibility issue here, but I think the philosophy arguably applies because not having both PIDs and pidfds implicitly pushes people to the new interface.)
Hmm.
Posted Jan 21, 2020 15:50 UTC (Tue)
by cyphar (subscriber, #110703)
[Link] (5 responses)
While I do understand wanting to maintain support for PIDs in newer syscalls (after all, in some cases you only get a PID from a user or other program), I don't think that a tracer program would be written in the way described. It's far more likely that the tracer would already have a pidfd open for each process it is tracing. But then again, since ptrace (and ptrace-related syscalls) doesn't use pidfds, it would also be fair to say that the interface mismatch would make the code ugly no matter what.
Posted Jan 21, 2020 19:32 UTC (Tue)
by rvolgers (guest, #63218)
[Link] (4 responses)
Of course, porting it would take quite a lot of effort probably, nevermind deprecating the old interface so all the cruft can be removed, but one can dream.
Posted Jan 21, 2020 19:52 UTC (Tue)
by roc (subscriber, #30627)
[Link]
Posted Jan 21, 2020 19:58 UTC (Tue)
by roc (subscriber, #30627)
[Link] (2 responses)
I would really like the ability to hand-off ptrace control to other processes by passing them a pidfd, but that would require lots more work. Something like:
Posted Jan 21, 2020 22:52 UTC (Tue)
by Paf (subscriber, #91811)
[Link] (1 responses)
I’m not really familiar with that part of ptrace, but I have had to look at the signal handling dance and it is a *mess*. (Not incorrect in any way, just... messy)
Posted Jan 22, 2020 2:44 UTC (Wed)
by roc (subscriber, #30627)
[Link]
Posted Jan 21, 2020 21:12 UTC (Tue)
by mirabilos (subscriber, #84359)
[Link] (4 responses)
“which” is an int, sure, but the PID is of type pid_t, while pidfds as file descriptors are of type int.
I’d rather have it…
int process_madvise(int pidfd, pid_t pid, …
… and use “pid” iff pidfd == -1 (which is the usual closed/invalid fd number).
Posted Jan 22, 2020 12:37 UTC (Wed)
by NAR (subscriber, #1313)
[Link] (3 responses)
My idea was to use the flags argument to select between pid and pidfd - but I guess that flag should be the same as accepted by madvise. If only C had function overloading... But it doesn't so maybe bite the bullet and create two functions: madvise_by_pid and madvise_by_pidfd.
Posted Jan 22, 2020 15:20 UTC (Wed)
by mirabilos (subscriber, #84359)
[Link] (2 responses)
No, there’s no extra variable, this is just a parameter.
You’d either call it with…
process_madvise(pidfd, /* ignored */ 0, …)
… or with…
process_madvise(-1, pid, …)
… so this question never comes up.
> bugs where both the pidfd and pid would be set
This is a very common interface, and the answer is trivial, as I stated in the comment above: pid is used iff pidfd == -1 (meaning if it’s not -1, pid will be ignored). This is a basic standard technique.
Posted Jan 22, 2020 16:59 UTC (Wed)
by james (subscriber, #1325)
[Link] (1 responses)
Posted Jan 22, 2020 17:19 UTC (Wed)
by mirabilos (subscriber, #84359)
[Link]
It’s really common practice to do things this way (maybe you don’t know this as you seem to be a C++ programmer, but in the C/UNIX world, we do).
Take, for example, mmap, when called with MAP_ANONYMOUS in flags, ignores the fd argument instead of checking it.
Posted Jan 22, 2020 1:25 UTC (Wed)
by KaiRo (subscriber, #1987)
[Link] (2 responses)
Posted Jan 22, 2020 6:41 UTC (Wed)
by cyphar (subscriber, #110703)
[Link] (1 responses)
> In this moment the tracer knows everything about tracee state, and pidfd brackets pidfd_open() and close() around actual action look just stupid, and this is cpu time wasting.
Posted Feb 3, 2020 20:56 UTC (Mon)
by nix (subscriber, #2304)
[Link]
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
long process_syscall(int pidfd, long number, ...)
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
* Make sure pidfd_wait or whatever can read the special ptrace status events.
* When pidfds are used, break the "ptrace parent" relationship in the kernel so *any* process with a pidfd for the tracee can ptrace() it or get the ptrace status events. (I bet this is *really* hard.)
But it would make much-requested rr features, like the ability to start debugging an in-progress rr recording without interrupting it, much more tractable.
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
“which”?
“which”?
“which”?
Perhaps the call should return an error if the pid is non-zero and pidfd isn't -1.
“which”?
“which”?
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs
process_madvise(), pidfd capabilities, and the revenge of the PIDs