Zero-copy TCP receive
Packet reception starts in the kernel with the allocation of a series of buffers to hold packets as they come out of the network interface. As a general rule, the kernel has no idea what will show up next from the interface, so it cannot know in advance who the intended recipient of the next packet to arrive in a given buffer will be. An implementation of zero-copy reception will thus have to map these packet buffers into user-space memory after the packets come in and are associated with an open socket.
That, in turn, implies a set of constraints that must be met. Mapping of memory into a process's address space is done on a per-page granularity; there is no way to map a fraction of a page. So inbound network data must be both page-aligned and page-sized when it ends up in the receive buffer, or it will not be possible to map it into user space. Alignment can be a bit tricky because the packets coming out of the interface start with the protocol headers, not the data the receiving process is interested in. It is the data that must be aligned, not the headers. Achieving this alignment is possible, but it requires cooperation from the network interface; in particular, it is necessary to use a network interface that is capable of splitting the packet headers into a different buffer as the packet comes in.
It is also necessary to ensure that the data arrives in chunks that are a multiple of the system's page size, or partial pages of data will result. That can be done by setting the maximum transfer unit (MTU) size properly on the interface. That, in turn, can require knowledge of exactly what the incoming packets will look like; in a test program posted with the patch set, Dumazet sets the MTU to 61,512. That turns out to be space for fifteen 4096-byte pages of data, plus 40 bytes for the IPv6 header and 32 bytes for the TCP header.
The core of Dumazet's patch set is the implementation of mmap() for TCP sockets. Normally, using mmap() on something other than an ordinary file creates a range of address space that can be used for purposes like communicating with a device. When it is called on a TCP socket, though, the behavior is a bit different. If the conditions are met (the next incoming data chunk is page-sized and page-aligned), the buffer(s) containing that data will be mapped into the calling process's address space, where it can be accessed directly. This operation also has the effect of consuming the incoming data, much as if it had been obtained with recvmsg() instead. That is, needless to say, an unusual side effect from an mmap() call.
When the incoming data has been processed, the process should call munmap() to release the pages and free the buffer for another incoming packet.
If things are not just right (there is only a partial page of data available, for example, or that data is not page-aligned), the mmap() call will fail, returning EINVAL. That will also happen if there is urgent data in the pipeline. In such cases, the call does not consume the data, and the application must fall back to recvmsg() to obtain it.
It has long been conventional wisdom in the kernel community that zero-copy schemes dependent on memory-mapping tricks will struggle to outperform implementations that simply copy the data. There is quite a bit of overhead involved in setting up and tearing down these mappings. Indeed, Dumazet cautioned in the patch introduction that there may not be a benefit if the application uses a lot of threads, since the contention for the mmap_sem lock will become too expensive. But it is still natural to wonder if performing zero-copy packet reception in this way is worth the trouble.
One way of reducing the cost would be to not call mmap() until several pages of data are available to be consumed, so that they can all be mapped in a single batch. The network stack provides a way to request that the application not be notified until a certain amount of data is pending in the form of the SO_RCVLOWAT option. That said, the socket() man page cautions:
That shortcoming would make SO_RCVLOWAT useless for this purpose. That problem appears to to have been fixed in 2008 for the 2.6.28 kernel, though, so the man page is a bit behind the times. Even so, there were still some shortcomings with SO_RCVLOWAT, including spurious wakeups, that Dumazet fixed as a part of this series.
In some benchmark results posted with the core
patch, Dumazet shows some impressive improvements in packet-processing
performance — from 129µs/MB to just 45µs/MB. Naturally, this is a tuned
test running in a controlled setting, but it shows that there are indeed
benefits to be had. Those benefits will be generally available before too
long; networking maintainer Dave Miller has applied the series for the 4.18 merge window.
Index entries for this article | |
---|---|
Kernel | Networking/Performance |
Posted Apr 20, 2018 2:26 UTC (Fri)
by luto (subscriber, #39314)
[Link] (8 responses)
Posted Apr 20, 2018 2:47 UTC (Fri)
by josh (subscriber, #17465)
[Link] (2 responses)
Posted Apr 20, 2018 15:06 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link]
Posted Apr 20, 2018 16:55 UTC (Fri)
by ejr (subscriber, #51652)
[Link]
Posted Apr 20, 2018 3:46 UTC (Fri)
by ebiederm (subscriber, #35028)
[Link] (4 responses)
Those I believe implement a shared ring buffer between kernel and user space. Something like that might be possible.
TCP is a little different though as you are seeing the abstraction. With I think tcp segment offload or the ingress equivalent I think it is very likely you will get the kind of packets needed in this case.
Doing anything more complicated (aka a ring buffer) I suspect would be quite a bit harder to implement and more fragile than what has been implemented here. As this sounds like it is just taking packets right out of the existing packet queue.
Posted Apr 20, 2018 7:04 UTC (Fri)
by k8to (guest, #15413)
[Link]
Posted Apr 20, 2018 15:08 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
Another possibility is just mapping the entire packet into the user-mode ring buffer and letting userspace skip over the embedded protocol headers --- sort of as a hybrid between a conventional network stack and a user-space network stack.
Posted Apr 20, 2018 20:04 UTC (Fri)
by epa (subscriber, #39769)
[Link]
Posted Apr 21, 2018 6:35 UTC (Sat)
by cladisch (✭ supporter ✭, #50193)
[Link]
The FireWire driver does this for isochronous packets.
> and letting userspace skip over the embedded protocol headers
The FireWire host controller interface is standardized and must support scatter+gather, so the driver can instruct it to write the header words into another buffer so that only the actual data bytes end up in the mmap buffer. This requires that the header size is fixed and 32-bit aligned, so doing the same for a TCP/IP interface would require more flexible hardware support.
There is also a mode that dumps everything into the buffer, where the application has to parse out the packet metadata and headers.
Posted Apr 20, 2018 12:56 UTC (Fri)
by post-factum (subscriber, #53836)
[Link]
Posted Apr 20, 2018 16:49 UTC (Fri)
by alkbyby (subscriber, #61687)
[Link]
Posted Apr 22, 2018 12:51 UTC (Sun)
by jmichels (subscriber, #98352)
[Link] (1 responses)
Posted Apr 23, 2018 2:18 UTC (Mon)
by sbates (subscriber, #106518)
[Link]
Posted Apr 26, 2018 9:19 UTC (Thu)
by amarao (subscriber, #87073)
[Link]
Zero-copy TCP receive
Zero-copy TCP receive
Zero-copy TCP receive
Zero-copy TCP receive
Zero-copy TCP receive
Zero-copy TCP receive
Zero-copy TCP receive
Zero-copy TCP receive
Zero-copy TCP receive
Zero-copy TCP receive
Zero-copy TCP receive
Zero-copy TCP receive
Zero-copy TCP receive
Zero-copy TCP receive