Zero-copy TCP receive

By Jonathan Corbet
April 19, 2018

In the performance-conscious world of high-speed networking, anything that can be done to avoid copying packet data is welcome. The MSG_ZEROCOPY feature added in 4.14 enables zero-copy transmission of data, but does not address the receive side of the equation. It now appears that the 4.18 kernel will include a zero-copy receive mechanism by Eric Dumazet to close that gap, at least for some relatively specialized applications.

Packet reception starts in the kernel with the allocation of a series of buffers to hold packets as they come out of the network interface. As a general rule, the kernel has no idea what will show up next from the interface, so it cannot know in advance who the intended recipient of the next packet to arrive in a given buffer will be. An implementation of zero-copy reception will thus have to map these packet buffers into user-space memory after the packets come in and are associated with an open socket.

That, in turn, implies a set of constraints that must be met. Mapping of memory into a process's address space is done on a per-page granularity; there is no way to map a fraction of a page. So inbound network data must be both page-aligned and page-sized when it ends up in the receive buffer, or it will not be possible to map it into user space. Alignment can be a bit tricky because the packets coming out of the interface start with the protocol headers, not the data the receiving process is interested in. It is the data that must be aligned, not the headers. Achieving this alignment is possible, but it requires cooperation from the network interface; in particular, it is necessary to use a network interface that is capable of splitting the packet headers into a different buffer as the packet comes in.

It is also necessary to ensure that the data arrives in chunks that are a multiple of the system's page size, or partial pages of data will result. That can be done by setting the maximum transfer unit (MTU) size properly on the interface. That, in turn, can require knowledge of exactly what the incoming packets will look like; in a test program posted with the patch set, Dumazet sets the MTU to 61,512. That turns out to be space for fifteen 4096-byte pages of data, plus 40 bytes for the IPv6 header and 32 bytes for the TCP header.

The core of Dumazet's patch set is the implementation of mmap() for TCP sockets. Normally, using mmap() on something other than an ordinary file creates a range of address space that can be used for purposes like communicating with a device. When it is called on a TCP socket, though, the behavior is a bit different. If the conditions are met (the next incoming data chunk is page-sized and page-aligned), the buffer(s) containing that data will be mapped into the calling process's address space, where it can be accessed directly. This operation also has the effect of consuming the incoming data, much as if it had been obtained with recvmsg() instead. That is, needless to say, an unusual side effect from an mmap() call.

When the incoming data has been processed, the process should call munmap() to release the pages and free the buffer for another incoming packet.

If things are not just right (there is only a partial page of data available, for example, or that data is not page-aligned), the mmap() call will fail, returning EINVAL. That will also happen if there is urgent data in the pipeline. In such cases, the call does not consume the data, and the application must fall back to recvmsg() to obtain it.

It has long been conventional wisdom in the kernel community that zero-copy schemes dependent on memory-mapping tricks will struggle to outperform implementations that simply copy the data. There is quite a bit of overhead involved in setting up and tearing down these mappings. Indeed, Dumazet cautioned in the patch introduction that there may not be a benefit if the application uses a lot of threads, since the contention for the mmap_sem lock will become too expensive. But it is still natural to wonder if performing zero-copy packet reception in this way is worth the trouble.

One way of reducing the cost would be to not call mmap() until several pages of data are available to be consumed, so that they can all be mapped in a single batch. The network stack provides a way to request that the application not be notified until a certain amount of data is pending in the form of the SO_RCVLOWAT option. That said, the socket() man page cautions:

The select(2) and poll(2) system calls currently do not respect the SO_RCVLOWAT setting on Linux, and mark a socket readable when even a single byte of data is available. A subsequent read from the socket will block until SO_RCVLOWAT bytes are available.

That shortcoming would make SO_RCVLOWAT useless for this purpose. That problem appears to to have been fixed in 2008 for the 2.6.28 kernel, though, so the man page is a bit behind the times. Even so, there were still some shortcomings with SO_RCVLOWAT, including spurious wakeups, that Dumazet fixed as a part of this series.

In some benchmark results posted with the core patch, Dumazet shows some impressive improvements in packet-processing performance — from 129µs/MB to just 45µs/MB. Naturally, this is a tuned test running in a controlled setting, but it shows that there are indeed benefits to be had. Those benefits will be generally available before too long; networking maintainer Dave Miller has applied the series for the 4.18 merge window.

Index entries for this article
Kernel	Networking/Performance

Zero-copy TCP receive

Posted Apr 20, 2018 2:26 UTC (Fri) by luto (subscriber, #39314) [Link] (8 responses)

Wouldn’t it be better (faster and saner) to mmap a magic region for zero copy reception and then use ioctl to materialize the data into it?

Zero-copy TCP receive

Posted Apr 20, 2018 2:47 UTC (Fri) by josh (subscriber, #17465) [Link] (2 responses)

The underlying memory map would still need to change each time, to avoid copying the underlying pages. That said, using mmap repeatedly for this does seem quite strange.

Zero-copy TCP receive

Posted Apr 20, 2018 15:06 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

Right. An ioctl could at least return a proper error code (say, EAGAIN), and there's precedent for an ioctl consuming data. There's no precedent for mmap being destructive!

Zero-copy TCP receive

Posted Apr 20, 2018 16:55 UTC (Fri) by ejr (subscriber, #51652) [Link]

Yeah, I'd expect something more along the lines of vmsplice(..., SPLICE_F_GIFT).

Zero-copy TCP receive

Posted Apr 20, 2018 3:46 UTC (Fri) by ebiederm (subscriber, #35028) [Link] (4 responses)

It might be worth comparing this to how PF_PACKET sockets works with mmap.

Those I believe implement a shared ring buffer between kernel and user space. Something like that might be possible.

TCP is a little different though as you are seeing the abstraction. With I think tcp segment offload or the ingress equivalent I think it is very likely you will get the kind of packets needed in this case.

Doing anything more complicated (aka a ring buffer) I suspect would be quite a bit harder to implement and more fragile than what has been implemented here. As this sounds like it is just taking packets right out of the existing packet queue.

Zero-copy TCP receive

Posted Apr 20, 2018 7:04 UTC (Fri) by k8to (guest, #15413) [Link]

Ring buffers are pretty efficient for some patterns of data passing, but I don't see how you could do a zero-copy ring buffer.

Zero-copy TCP receive

Posted Apr 20, 2018 15:08 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (2 responses)

> TCP is a little different though as you are seeing the abstraction. With I think tcp segment offload or the ingress equivalent I think it is very likely you will get the kind of packets needed in this case.

Another possibility is just mapping the entire packet into the user-mode ring buffer and letting userspace skip over the embedded protocol headers --- sort of as a hybrid between a conventional network stack and a user-space network stack.

Zero-copy TCP receive

Posted Apr 20, 2018 20:04 UTC (Fri) by epa (subscriber, #39769) [Link]

Could showing the header data to userspace conceivably introduce a security hole? It might be better to zero out those bytes (once the kernel has finished with them of course) before handing the page over to userspace.

Zero-copy TCP receive

Posted Apr 21, 2018 6:35 UTC (Sat) by cladisch (✭ supporter ✭, #50193) [Link]

> Another possibility is just mapping the entire packet into the user-mode ring buffer

The FireWire driver does this for isochronous packets.

> and letting userspace skip over the embedded protocol headers

The FireWire host controller interface is standardized and must support scatter+gather, so the driver can instruct it to write the header words into another buffer so that only the actual data bytes end up in the mmap buffer. This requires that the header size is fixed and 32-bit aligned, so doing the same for a TCP/IP interface would require more flexible hardware support.

There is also a mode that dumps everything into the buffer, where the application has to parse out the packet metadata and headers.

Zero-copy TCP receive

Posted Apr 20, 2018 12:56 UTC (Fri) by post-factum (subscriber, #53836) [Link]

Would speculative page faults patchset be able to address the contention for the mmap_sem lock?

Zero-copy TCP receive

Posted Apr 20, 2018 16:49 UTC (Fri) by alkbyby (subscriber, #61687) [Link]

It is a little odd to see such unusual band controversial change that also introduces new API, to be merged so quickly.

Zero-copy TCP receive

Posted Apr 22, 2018 12:51 UTC (Sun) by jmichels (subscriber, #98352) [Link] (1 responses)

Wouldn't this also have to out perform kernel bypass functionality offered by vendors such as Solar Flare?

Zero-copy TCP receive

Posted Apr 23, 2018 2:18 UTC (Mon) by sbates (subscriber, #106518) [Link]

Well I’d assume these proposed patches are for a variety of NICs from multiple vendors. The SolarFlare stuff is (I assume) specific to their NICs.

Zero-copy TCP receive

Posted Apr 26, 2018 9:19 UTC (Thu) by amarao (subscriber, #87073) [Link]

I may be naive, but I feel that hiding headers for any zero copy application is useless and adds compexity. Kernel can instead put a whole packet 'as is' into userspace memory and provide two 'data start/data end' pointers for each packet to help application to ignore headers.