A new kernel polling interface
On January 4, Christoph Hellwig posted a new polling API based on the asynchronous I/O (AIO) mechanism. This may come as a surprise to some, since AIO is not the most loved of kernel interfaces and it tends not to get a lot of attention. AIO allows for the submission of I/O operations without waiting for their completion; that waiting can be done at some other time if need be. The kernel has had AIO support since the 2.5 days, but it has always been somewhat incomplete. Direct file I/O (the original use case) works well, as does network I/O. Many other types of I/O are not supported for asynchronous use, though; attempts to use the AIO interface with them will yield synchronous behavior. In a sense, polling is a natural addition to AIO; the whole point of polling is usually to avoid waiting for operations to complete.
The patches add a new command (IOCB_CMD_POLL) that can be passed in an I/O control block (IOCB) to io_submit() along with any of the usual POLL* flags describing the type of I/O that is desired — POLLIN for data available to read, for example. This command, like other AIO commands, will not (necessarily) complete before io_submit() returns. Instead, when the indicated file descriptor is ready for the requested type of I/O, a completion event will be queued. A subsequent call to io_getevents() (or the io_pgetevents() variant, added by the patch set, that blocks signals during the operation) will return that event, and the calling application will know that it can perform I/O on the indicated file descriptor. AIO poll operations always operate in the "one-shot" mode; once a poll notification has been generated, a new IOCB_CMD_POLL IOCB must be submitted for that file descriptor if further notifications are needed.
Thus far, this interface sounds more difficult to use than the existing poll system calls. There is a payoff, though, that comes in the form of the AIO ring buffer. This poorly documented aspect of the AIO subsystem maps a circular buffer into the calling process's address space. That process can then consume notification events directly from the buffer rather than calling io_getevents(). Multiple notifications can be consumed without the need to enter the kernel at all, and polling for multiple file descriptors can be re-established with a single io_submit() call. The result, Hellwig said in the patch posting, is an up-to-10% improvement in the performance of the Seastar I/O framework. More recently, he noted that the improvement grows to 16% on kernels with page-table isolation turned on.
Internally to the kernel, any device driver (or other subsystem that exports a file_operations structure) can support the new poll interface, but some small changes will be required. It is not, however, necessary to support (or even know about) AIO in general. In current kernels, the polling system calls are all supported by the poll() method in struct file_operations:
int (*poll) (struct file *file, struct poll_table_struct *table);
This function must perform two actions: setting up notifications for when the underlying file is ready for I/O, and returning the types of I/O that could be performed without blocking now. The first is done by adding one or more wait queues to the provided table; the driver will perform a wakeup call on one of those queues when the state of the device changes. The current readiness state is the return value from the poll() method itself.
Supporting AIO-based polling requires splitting those two functions into separate file_operations methods. Thus, there are two new entries to that structure:
struct wait_queue_head *(*get_poll_head)(struct file *file, int mask); int (*poll_mask) (struct file *file, int mask);
(The actual patches use the new typedef __poll_t for the mask, but that typedef isn't in the mainline kernel yet). The polling subsystem will call get_poll_head() to obtain a pointer to the wait queue that will be notified when the device's I/O readiness state changes; poll_mask() will be called to get the current readiness state. A driver that implements these two operations need not (and probably should not) retain its implementation of the older poll() interface.
One potential limitation built into this API is that there can only be a
single wait queue that receives notifications for a given file.
The current interface, instead, allows multiple queues to be used, and a
number of drivers take advantage of that fact to use, for example,
different queues for read and write readiness. Contemporary wait queues
offer enough flexibility that the use of multiple queues should not be
necessary anymore. If a driver cannot be changed, Hellwig said, "the
driver just won't support aio poll
"
There have not been a lot of comments in response to the patch posting so
far; many of the relevant developers have been preoccupied with other
issues in the last week. It is hard to argue with a 10% performance
improvement, though, so some form of this patch seems likely to get into
the mainline sooner or later — interested parties can keep checking the
mainline repository to see if it's there yet. Whether we'll see a fifth
polling interface added in the future is anybody's guess, though.
Index entries for this article | |
---|---|
Kernel | Asynchronous I/O |
Kernel | poll() |
Posted Jan 9, 2018 22:09 UTC (Tue)
by pbonzini (subscriber, #60935)
[Link]
Posted Jan 9, 2018 23:40 UTC (Tue)
by Nagarathnam (guest, #116887)
[Link] (2 responses)
Posted Jan 10, 2018 6:14 UTC (Wed)
by helge.bahmann (subscriber, #56804)
[Link] (1 responses)
Posted Jan 12, 2018 0:14 UTC (Fri)
by Nagarathnam (guest, #116887)
[Link]
Posted Jan 10, 2018 0:06 UTC (Wed)
by pj (subscriber, #4506)
[Link]
Posted Jan 10, 2018 0:29 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link]
Posted Jan 10, 2018 1:20 UTC (Wed)
by shemminger (subscriber, #5739)
[Link] (2 responses)
Posted Jan 10, 2018 19:32 UTC (Wed)
by stefanha (subscriber, #55072)
[Link] (1 responses)
The Linux AIO ring buffer has been upstream since pre-git history. It is not a new attack surface and this patch series doesn't even appear to touch the ring buffer code.
Ring buffers are used across other security boundaries like cpu<->hardware and VM<->hypervisor so they can be implemented in a secure fashion. Just as with syscalls, it's important to copy in untrusted data before validating it. Ring buffer producers and consumers typically maintain their own state that is not accessible across the trust boundary. They publish their internal state (e.g. ring indices) to the ring and fetch the other side's state from the ring, but that is secure.
Posted Jan 11, 2018 12:37 UTC (Thu)
by rvolgers (guest, #63218)
[Link]
Posted Jan 10, 2018 10:33 UTC (Wed)
by mezcalero (subscriber, #45103)
[Link] (9 responses)
I think it would be quite good if kernel folks designing those interfaces would have a look at what userspace actually does with those APIs and then make things less awful to use, because quite frankly, all of select(), poll(), ppoll(), epoll are just plain terrible, just to different levels. Have a look how glib or systemd's sd-event end up handling priorization (or guarantee event ordering) or hook up waitid() to event loops, it's terrible the choices one has to make there. All the great optimizations that epoll supposedly permits, and this aio stuff will permit too are so entirely useless if in this iteration again it all comes crashing down as this only works in synthetic, very specific test cases, and not for any of the generic event loops that are used in userspace IRL.
I couldn't care less about yet another API for all of this, even if it reduces the number of syscalls in niche cases even further, if we can't get the basic stuff done properly before. I'd much rather have a safe childfd() concept, and guaranteed event ordering/priorization in epoll before anything else. Or just a usable inotify() or fanotify() would be great...
Lennart
Posted Jan 10, 2018 18:42 UTC (Wed)
by zyga (subscriber, #81533)
[Link]
Posted Jan 10, 2018 21:07 UTC (Wed)
by flussence (guest, #85566)
[Link] (1 responses)
Posted Jan 10, 2018 21:16 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Additionally, userspace API in Windows (overlapped IO) is designed decidedly better than epoll.
Posted Jan 18, 2018 9:42 UTC (Thu)
by HelloWorld (guest, #56129)
[Link] (2 responses)
Posted Feb 6, 2018 4:15 UTC (Tue)
by fest3er (guest, #60379)
[Link] (1 responses)
This could solve the problem of inotify-wait never exiting because it will never again write to its socket/pipe to a userspace program, and thus will never detect that the reader of the pipe has gone away; the only way to detect that the reader end of a pipe is gone is to write to the pipe. Proof? Hot-plug a drive. Run inotify-wait looking for that /dev node to be deleted, and pipe it to a shell script that waits for the drive to be unplugged. When you unplug the drive, inotify-wait tells the script which then continues its processing and exits. But the inotify-wait program sits there forever because the file it was watching for deletion has been deleted and can never be deleted again; and because it will never receive another notice of the file's deletion, it will never again write to the pipe and, thus, it won't detect that the shell script (the pipe's reader) is gone. It's a deficiency in Linux. (And no, if you re-connect the drive and unplug it again, the comatose inotify-wait wakes not, because the /dev node that instance of inotify-wait is watching no longer exists and will never again exist.)
Polling is always wasteful. Even when there're no other options. So have a thread that reads the fdnotify FD, a thread that reads the inotify FD, a thread that reads the eventloop pipe FD, a thread that handles timeouts. Have each feed the dispatcher. No more polling. Action occurs only when something happens. Data transfer on the fdnotify FD should be small: 32 bits for the FD and 32 bits for the reason.
Posted Jun 27, 2018 16:30 UTC (Wed)
by ibukanov (subscriber, #3942)
[Link]
Posted Jan 19, 2018 19:17 UTC (Fri)
by davmac (guest, #114522)
[Link] (2 responses)
I don't see a lot of benefit to moving prioritisation into the kernel. You have to pick a particular priority model (eg. assign all event sources a fixed numerical priority) and if the userspace needs a more complex model (weighting based on latency, or whatever) then you're pretty much back to square one anyway.
As it is, userspace has to maintain a queue of events, and the kernel just provides the events that go into the queue (this is assuming of course that you even need priority levels for different event sources, and that's not always the case). That's a pain, but you generally don't have to actually do it yourself - by which I mean, there are plenty of event loop libraries now which handle this for you.
The notion that the kernel itself needs to provide an API that is straightforward and generally usable from applications is flawed. It's much better to have flexibility than a rigid policy in the kernel, when any inherent complexity can always be hidden behind a library layer.
OTOH I completely agree it would be nice if there was a decent, reliable, non-signal way of watching process status via a file descriptor rather than the mess of listening for signals.
Posted Jan 19, 2018 20:19 UTC (Fri)
by excors (subscriber, #95769)
[Link] (1 responses)
Perhaps userspace could provide an eBPF program that implements a partial order over events, and the kernel can use that to do a topological sort.
Posted Feb 23, 2018 0:25 UTC (Fri)
by vomlehn (guest, #45588)
[Link]
Posted Jan 10, 2018 11:10 UTC (Wed)
by sasha (guest, #16070)
[Link] (10 responses)
Posted Jan 10, 2018 19:24 UTC (Wed)
by pbonzini (subscriber, #60935)
[Link] (3 responses)
Posted Jan 19, 2018 19:03 UTC (Fri)
by davmac (guest, #114522)
[Link] (2 responses)
From the article:
> AIO poll operations always operate in the "one-shot" mode
While "one-shot" isn't precisely the same thing as edge-triggered, they can largely be used with similar effect. If you want edge triggering and you have one-shot, you can arm a level-triggered one-shot listener and it will fire either immediately on the next "up" edge. Your application is in control of the "down" edge (i.e. you read all the data from the socket until you receive EAGAIN) and if you re-arm after that point, you effectively get notified of the next "up" edge in the same way that edge-triggered notification would.
The main differences are that (a) you have to explicitly re-arm and (b) you won't get extra notifications if you happen to get two edges while processing (i.e. if you drain all data from the socket but more comes in before you do another read and notice that the buffer is empty). The (a) point is a down-side, but (b) is pretty much essential if you want to poll for events from multiple threads, since you can otherwise end up with more than one thread trying to service the same active connection.
Posted Aug 27, 2018 23:50 UTC (Mon)
by ncm (guest, #165)
[Link] (1 responses)
Posted Aug 29, 2018 9:23 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
You can set that up with level-triggered notifications; use a software mutex mechanism to stop things happening in parallel on one event source if necessary, then ask for a new notification immediately upon receiving one (or at a good point in your processing of incoming events), and you will be notified of new events that happen while you're processing the older ones.
This is basically how the IRQ controller on hardware that resignals edge-triggered interrupts that came in while interrupts are blocked works - bear in mind that on some (older) IRQ controllers, edge-triggered interrupts that came in while interrupts are still blocked would simply be lost by the hardware due to the race between unblocking and the interrupt arriving.
Posted Jan 10, 2018 19:28 UTC (Wed)
by pbonzini (subscriber, #60935)
[Link] (5 responses)
Posted Jan 11, 2018 15:05 UTC (Thu)
by kpfleming (subscriber, #23250)
[Link] (4 responses)
Posted Jan 12, 2018 2:31 UTC (Fri)
by wahern (subscriber, #37304)
[Link] (3 responses)
Posted Apr 3, 2018 5:49 UTC (Tue)
by anmolsarma (guest, #123439)
[Link] (2 responses)
Posted Apr 4, 2018 1:22 UTC (Wed)
by zlynx (guest, #2285)
[Link] (1 responses)
Now, I don't know what the kernel actually does since I haven't looked. But that's what the documentation says.
Posted Jun 20, 2018 17:03 UTC (Wed)
by nyrahul (guest, #119310)
[Link]
Posted Jan 10, 2018 16:33 UTC (Wed)
by corbet (editor, #1)
[Link]
I forgot to mention in the article that early Red Hat kernels had this functionality with the same API, which means that the libaio library already has support for it.
Posted Jan 11, 2018 1:49 UTC (Thu)
by xanni (subscriber, #361)
[Link] (1 responses)
Posted Feb 6, 2018 4:19 UTC (Tue)
by fest3er (guest, #60379)
[Link]
Posted Jan 12, 2018 9:58 UTC (Fri)
by kkourt (subscriber, #48092)
[Link] (1 responses)
Any pointers on how this works?
The relevant thing I found was this: http://git.infradead.org/users/hch/libaio.git/blob/refs/h..., i.e,. a way to check if the ring buffer is empty. The code here: http://git.infradead.org/users/hch/libaio.git/blob/refs/h... uses it to avoid the syscall if there is nothing in the queue. But, it will still enter the kernel to consume events.
Posted Jan 15, 2018 9:11 UTC (Mon)
by stefanha (subscriber, #55072)
[Link]
https://git.qemu.org/?p=qemu.git;a=blob;f=block/linux-aio.c;...
Posted Jan 17, 2018 6:40 UTC (Wed)
by fuuuuuuc (guest, #120531)
[Link] (2 responses)
Posted Mar 9, 2018 10:46 UTC (Fri)
by dcg (subscriber, #9198)
[Link]
Posted Jun 27, 2018 16:41 UTC (Wed)
by ibukanov (subscriber, #3942)
[Link]
Posted Oct 23, 2020 1:57 UTC (Fri)
by sergeyn (guest, #142693)
[Link]
A new kernel polling interface
Similar interface for futex?
Or, turn this around completely and funnel readiness notifications to futex... shameless plug: Extending futex for Kernel to User Notification
Similar interface for futex?
Similar interface for futex?
A new kernel polling interface
A new kernel polling interface
netlink ring buffer
netlink ring buffer
netlink ring buffer
A new kernel polling interface
A new kernel polling interface
For reference: https://lwn.net/Articles/636646/
A new kernel polling interface
A new kernel polling interface
A new kernel polling interface
A new kernel polling interface
A new kernel polling interface
A new kernel polling interface
A new kernel polling interface
A new kernel polling interface--EBF
What about edge-triggered notification?
What about edge-triggered notification?
What about edge-triggered notification?
What about edge-triggered notification?
What about edge-triggered notification?
What about edge-triggered notification?
What about edge-triggered notification?
What about edge-triggered notification?
What about edge-triggered notification?
What about edge-triggered notification?
What about edge-triggered notification?
There is a new version of the patch series out; it changes the file_operations prototypes a bit.
A couple of notes
A new kernel polling interface
A new kernel polling interface
A new kernel polling interface
A new kernel polling interface
A new kernel polling interface
A new kernel polling interface
A new kernel polling interface
A new kernel polling interface