|
|
Subscribe / Log in / New account

io_uring, SCM_RIGHTS, and reference-count cycles

By Jonathan Corbet
February 13, 2019
The io_uring mechanism that was described here in January has been through a number of revisions since then; those changes have generally been fixing implementation issues rather than changing the user-space API. In particular, this patch set seems to have received more than the usual amount of security-related review, which can only be a good thing. Security concerns became a bit of an obstacle for io_uring, though, when virtual filesystem (VFS) maintainer Al Viro threatened to veto the merging of the whole thing. It turns out that there were some reference-counting issues that required his unique experience to straighten out.

The VFS layer is a complicated beast; it must manage the complexities of the filesystem namespace in a way that provides the highest possible performance while maintaining security and correctness. Achieving that requires making use of almost all of the locking and concurrency-management mechanisms that the kernel offers, plus a couple more implemented internally. It is fair to say that the number of kernel developers who thoroughly understand how it works is extremely small; indeed, sometimes it seems like Viro is the only one with the full picture.

In keeping with time-honored kernel tradition, little of this complexity is documented, so when Viro gets a moment to write down how some of it works, it's worth paying attention. In a long "brain dump", Viro described how file reference counts are managed, how reference-count cycles can come about, and what the kernel does to break them. For those with the time to beat their brains against it for a while, Viro's explanation (along with a few corrections) is well worth reading. For the rest of us, a lighter version follows.

Reference counts for file structures

The Linux kernel uses the file structure to represent an open file. Every open file descriptor in user space is represented by a file structure in the kernel; in essence, a file descriptor is an index into a table in struct files_struct, where a pointer to the file structure can be found. There is a fair amount of information kept in the file structure, including the current position within the file, the access mode, the file_operations structure, a private_data pointer for use by lower-level code, and more.

Like many kernel data structures, file structures can have multiple references to them outstanding at any given time. As a simple example, passing a file descriptor to dup() will allocate a second file descriptor referring to the same file structure; many other examples exist. The kernel must keep track of these references to be able to know when any given file structure is no longer used and can be freed; that is done using the f_count field. Whenever a reference is created, by calling dup(), forking the process, starting an I/O operation, or any of a number of other ways, f_count must be increased. When a reference is removed, via a call to close() or exit(), for example, f_count is decreased; when it reaches zero, the structure can be freed.

Various operations within the kernel can create references to file structures; for example, a read() call will hold a reference for the duration of the operation to keep the file structure in existence. Mounting a filesystem contained within a file via the loopback device will create a reference that persists until the filesystem is unmounted again. One important point, though, is that references to file structures are not, directly or indirectly, contained within file structures themselves. That means that any given chain of references cannot be cyclical, which is a good thing. Cycles are the bane of reference-counting schemes; once one is created, none of the objects contained within the cycle will ever see their reference count return to zero without some sort of external intervention. That will prevent those objects from ever being freed.

Enter SCM_RIGHTS

Unfortunately for those of us living in the real world, the situation is not actually as simple as portrayed above. There are indeed cases where cycles of references to file structures can be created, preventing those structures from being freed. This is highly unlikely to happen in the normal operation of the system, but it is something that could be done by a hostile application, so the kernel must be prepared for it.

Unix-domain sockets are used for communication between processes running on the same system; they behave much like pipes, but with some significant differences. One of those is that they support the SCM_RIGHTS control message, which can be used to transmit an open file descriptor from one process to another. This feature is often used to implement request-dispatching systems or security boundaries; one process has the ability to open a given file (or network socket) and make decisions on whether another process should get access to the result. If so, SCM_RIGHTS can be used to create a copy of the file descriptor and pass it to the other end of the Unix-domain connection.

SCM_RIGHTS will obviously create a new reference to the file structure behind the descriptor being passed. This is done when the sendmsg() call is made, and a structure containing pointers to the file structure being passed is attached to the receiving end of the socket. This allows the passing side to immediately close its file descriptor after passing it with SCM_RIGHTS; the reference taken when the operation is queued will keep the file open for as long as it takes the receiving end to accept the new file and take ownership of the reference. Indeed, the receiving side need not have even accepted the connection on the socket yet; the kernel will stash the file structure in a queue and wait until the receiver gets around to asking for it.

Queuing SCM_RIGHTS messages in this way makes things work the way application developers would expect, but it has an interesting side effect: it creates an indirect reference from one file structure to another. The file structure representing the receiving end of an SCM_RIGHTS message, in essence, owns a reference to the file structure transferred in that message until the application accepts it. That has some important implications.

Suppose some process connects to itself via a Unix-domain socket, so it has two file descriptors, call them FD1 and FD2, one corresponding to each end of the connection. It then proceeds to use SCM_RIGHTS to send FD1 to FD2 and the reverse; each file descriptor is sent to the opposite end. We now have a situation where the file structure at each end of the socket indirectly holds a reference to the other — a cycle, in other words. This can work just fine; if the process then accepts the file descriptor sent to either end (or both), the cycle will be broken and all will be well.

If, however, the process closes FD1 and FD2 without accepting the transferred file descriptors, it will remove the only two references to the underlying file structures — except for those that make up the cycle itself. Those file structures will have a permanently elevated reference count and can never be freed. If this happens once as the result of an application bug, there is no great harm done; a small amount of kernel memory will be leaked.. If a hostile process does it repeatedly, though, those cycles could eventually consume a great deal of memory.

There are other ways of using SCM_RIGHTS to create this kind of cycle as well. The problem always involves descriptor-passing datagrams that have never been received, though; this fact is used by the kernel to detect and break cycles. When a file structure corresponding to a Unix-domain socket gains a reference from an SCM_RIGHTS datagram, the inflight field of the corresponding unix_sock structure is incremented. If the reference count on the file structure is higher than the inflight count (which is the normal state of affairs), that file has external references and is thus not part of an unreachable cycle.

If, instead, the two counts are equal, that file structure might be part of an unreachable cycle. To determine whether that is the case, the kernel finds the set of all in-flight Unix-domain sockets for which all references are contained in SCM_RIGHTS datagrams (for which f_count and inflight are equal, in other words). It then counts how many references to each of those sockets come from SCM_RIGHTS datagrams attached to sockets in this set. Any socket that has references coming from outside the set is reachable and can be removed from the set. If it is reachable, and if there are any SCM_RIGHTS datagrams waiting to be consumed attached to it, the files contained within that datagram are also reachable and can be removed from the set.

At the end of an iterative process, the kernel may find itself with a set of in-flight Unix-domain sockets that are only referenced by unconsumed (and unconsumable) SCM_RIGHTS datagrams; at this point, it has a cycle of file structures holding the only references to each other. Removing those datagrams from the queue, releasing the references they hold, and discarding them will break the cycle.

As one might imagine, given that the VFS is involved, there is more complexity than has been described above and some gnarly locking issues involved in carrying out these operations. See Viro's message for the gory details.

Fixing io_uring

Among the features provided by io_uring is the ability to "register" one or more files with an open ring; that speeds I/O operations by eliminating the need to acquire and release references to the registered files every time. When a file is registered with an io_uring, the kernel will create and hold a reference for the duration of that registration. This is a useful feature but it contained a problem that, seemingly, only somebody with a Viro-level understanding of the VFS could spot, describe, and fix; it is a new variant on the cycle problem described above. In short: a process could create a Unix-domain socket and register both ends with an io_uring. If it were then to pass the file descriptor corresponding to the io_uring itself over that socket, then close all of the file descriptors, a cycle would be created. The io_uring code was unprepared for that eventuality.

Viro proposed a solution that involves making the file registration mechanism set up the SCM_RIGHTS data structures as if the registered file descriptor were being passed over a Unix-domain socket. There is a useful analogy here; registering a file can be thought of as passing it to the kernel to be operated on directly. Once the setup has been done, the same cycle-breaking logic will find (and fix) cycles created using io_uring structures.

Jens Axboe, the author of io_uring, implemented the solution and verified that it works. With that issue resolved, it appears that the path to merging io_uring in the 5.1 development cycle may be clear. In the process, a bit of light has been shed on a corner of the VFS that few people understand. The problem of a lack of people with a wide understanding of the VFS layer as a whole, though, is likely to come up again; it rather looks like a cycle that we have not yet gotten out of.

Index entries for this article
KernelAsynchronous I/O
KernelFilesystems/Virtual filesystem layer
Kernelio_uring


to post comments

epoll

Posted Feb 13, 2019 17:14 UTC (Wed) by abatters (✭ supporter ✭, #6932) [Link] (2 responses)

The act of registering one file with another file reminded me of epoll. Out of curiosity, I went to look at how epoll handles this problem. The answer is that epoll doesn't increment the reference count when a file is added to a epoll set. Instead, epoll hooks into the file cleanup path to automatically remove a file from all epoll sets when its reference count drops to 0. See:

linux/fs/eventpoll.c::eventpoll_release_file()

epoll

Posted Feb 13, 2019 19:30 UTC (Wed) by axboe (subscriber, #904) [Link] (1 responses)

Which is a lot less elegant imho, and introduces extra conditionals in the code.

epoll

Posted Feb 15, 2019 15:12 UTC (Fri) by dw (subscriber, #12017) [Link]

but epoll's semantics are different, it's basically a "weak reference". Closing the FD causes the poller to unregister it, which makes sense, as the only alternative is epoll_wait() yielding events on a file for which no FD exists

io_uring, SCM_RIGHTS, and reference-count cycles

Posted Feb 13, 2019 20:35 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> It then counts how many references to each of those sockets come from SCM_RIGHTS datagrams attached to sockets in this set. Any socket that has references coming from outside the set is reachable and can be removed from the set.
A tracing garbage collector, in other words.

io_uring, SCM_RIGHTS, and reference-count cycles

Posted Feb 13, 2019 23:46 UTC (Wed) by viro (subscriber, #7872) [Link] (6 responses)

FWIW, gnarly locking issues are mostly in socket-related stuff. I'd been reading through net/unix/*.c for the last week or so and it looks like the code didn't get a serious review (which, alas, pretty much has to involve people who are *not* intimately familiar with it - amazing how much crap gets caught by asking yourself "why is this done (at all|that way)?" and trying to figure it out) for quite a while ;-/

As for the documentation... TBH, I've lost count of how many times I'd sat down to put it together; the usual result is a series of tree-wide searches to verify the rules being described, followed by getting sidetracked to fix some bogosity caught by those. Sometimes in VFS proper, sometimes in filesystems, sometimes it's drivers or networking or ipc or kvm or... getting creative. With any luck the results do make the kernel better, but by the time the dust settles the original analysis needs to be re-verified (call graphs changes, locking conditions at relevant call sites, etc.). Lather, rinse, repeat - usually it's 2-4 cycles a year ;-/

Result is an impressive pile of notes (coherent pieces of text interspersed with edited and annotated git grep output, call graphs, need-to-fix-that-bogosity-someday notes, CoC-violating rants, etc.)

The thing is, it's not just VFS - _some_ stuff got encapsulated sanely, but quite a bit of data structures are played with by very odd places in the kernel in very odd ways. For example, I hadn't been able to find anyone who would admit understanding arch/ia64/kernel/perfmon.c, and that thing used to play with struct file life cycle in extremely irregular ways - had been quite a thorn for more than a decade until it got disabled in Kconfig (and seeing that nobody has complained since then, it'll hopefully go away, and good riddance).

I don't know how to get from braindumps like that one to the set of coherent docs. Note that this one does not go into
* any kind of details on modifying descriptor tables and primitives for work with descriptors (iterating, etc.); relatively irrelevant for this thread, definitely needed in any documentation of descriptor tables.
* ->flush() method and notifying file of getting disconnected from descriptors (the only relevance to that thread would be "no, it's not usable for anything in this case - you'll keep getting false positives from hell every time something calls system(3)"; for any documentation of struct file life cycle it would obviously need to be included)
* struct file lifecycle (all that is covered is basically from successful open to final fput(); alloc_file_...() and friends are not covered at all and neither are the things _after_ the final fput())
* use of struct files * as opaque ID for POSIX locks/leases/etc. and related merry horrors in network/cluster filesystems (belongs in discussion of struct files lifetime and places where it can and cannot be poked in)
* RCU-related issues (fortunately, fairly self-contained area)
* lifecycle for unix_sock and related locking (I'd been nowhere near up-to-date on that; digging through this code proves to be... fruitful, as in "interesting bugs keep turning up", some in places like aushit). Again, it's a separate topic, but it *is* getting involved here, especially now that Jens is copying gobs of that stuff into his code; we'll need to turn that into a small set of well-defined primitives, or that will be a source of massive PITA for years to come.
* higher-level discussion of the nature of objects involved (descriptors vs. opened files vs. files being accessed) - that one I probably can fish out of the pile, remove the unprintable parts and turn into a coherent text, but that material is a lot better covered by various textbooks, so I decided to skip it.

So it was a mashup of at least three different pieces, with different level of details and rather uneven style; it's still useful as concentrated background information relevant to the problem at hand, but turning that into sane documentation is not an easy task ;-/ Taken together and turned into readable text it would grow into a counterpart of a couple of chapters in The Daemon Book. And that's a fairly small part of the interfaces - sure, it's the first one you get through on a lot of syscalls, but...

I'll be glad to assist with getting such docs done (supplying missing pieces, answering questions regarding the relationship between the topics involved, etc.), but I'm afraid that I'm not up to doing it all on my own. Another thing to keep in mind: quite a few things can change, quite possibly - as the direct result of trying to document the situation. Freezing the kernel interfaces while the description gets written is not going to happen - not for something with that wide a surface. Especially since all that stuff is reachable for sufficiently enterprising driver willing to poke its tender bits into machinery (and recreate the Modern Times scene with trip through the gears, often enough).

io_uring, SCM_RIGHTS, and reference-count cycles

Posted Feb 14, 2019 4:11 UTC (Thu) by unixbhaskar (guest, #44758) [Link]

Thanks, a bunch Al ! sometimes people need this kind of explanations to a get a kick on their butt to do well. People (obviously including me!) don't know so many things and your commentary made so much good to the wider audience to understand the intricacies.

io_uring, SCM_RIGHTS, and reference-count cycles

Posted Feb 14, 2019 8:43 UTC (Thu) by Freeaqingme (subscriber, #103259) [Link] (3 responses)

Have you considered simply putting all those notes online with a huge disclaimer that it may very well be outdated at the moment of publishing? At least some parts would probably still be relevant, and it may help people see and understand why certain strategies were used/changed through the times.

As a bonus, someone may pick up on those notes and use them as a starting point to convert into perhaps more coherent, up2date, documentation.

io_uring, SCM_RIGHTS, and reference-count cycles

Posted Feb 14, 2019 10:32 UTC (Thu) by kay (subscriber, #1362) [Link] (2 responses)

corbet? ;)

io_uring, SCM_RIGHTS, and reference-count cycles

Posted Feb 15, 2019 13:40 UTC (Fri) by ermo (subscriber, #86690) [Link]

Yeah, maybe the Linux Foundation would be willing to sponsor some work by you and corbet, which could also become a series of articles here on LWN?

Everyone wins?

io_uring, SCM_RIGHTS, and reference-count cycles

Posted Feb 17, 2019 15:37 UTC (Sun) by andyc (subscriber, #1130) [Link]

io_uring, SCM_RIGHTS, and reference-count cycles

Posted Feb 14, 2019 18:31 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]

I don't know how to get from braindumps like that one to the set of coherent docs.

I don't know for sure either, but I would bet a good starting place would be somebody (like the Linux Foundation) hiring a technical writer to do most of the work for you. Documentation will continue to lag behind code until somebody is willing to pay real money to get it done.

io_uring, SCM_RIGHTS, and reference-count cycles

Posted Feb 17, 2019 21:10 UTC (Sun) by rweikusat2 (subscriber, #117920) [Link]

Some minor additions: A process doesn't need to "connect to itself" to create an SCM_RIGHTS loop. Using to unconnected AF_UNIX sockets should work, too. There's also a socketpair systemcall which creates a pair of connected AF_UNIX sockets.

io_uring, SCM_RIGHTS, and reference-count cycles

Posted Feb 22, 2019 4:29 UTC (Fri) by scientes (guest, #83068) [Link]

The is some pretty atrocious code in systemd-journald (that I wrote) that uses proc to get the capabilities of the logging process for every logged message. My concern that this was too slow for the hot path was ignores and it was merged. It would be nice if SCM_RIGHTS of similar can allow removing this horrible code.

io_uring, SCM_RIGHTS, and reference-count cycles

Posted Feb 24, 2019 14:52 UTC (Sun) by Alex.C (guest, #130620) [Link]

For interest in sample code, here is a short code : https://github.com/acassen/socket-takeover

Use-case here was to provide a seamlessly takeover from one process to another for critical software upgrade (used for components on mobile core-network).


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds