Low-latency Ethernet device polling

By Jonathan Corbet
May 21, 2013

Linux is generally considered to have one of the most fully featured and fast networking stacks available. But there are always users who are not happy with what's available and who want to replace it with something more closely tuned for their specific needs. One such group consists of people with extreme low latency requirements, where each incoming packet must be responded to as quickly as possible. High-frequency trading systems fall into this category, but there are others as well. This class of user is sometimes tempted to short out the kernel's networking stack altogether in favor of a purely user-space (or purely hardware-based) implementation, but that has problems of its own. A relatively small patch to the networking subsystem might just be able to remove that temptation for at least some of these users.

Network interfaces, like most reasonable peripheral devices, are capable of interrupting the CPU whenever a packet arrives. But even a moderately busy interface can handle hundreds or thousands of packets per second; per-packet interrupts would quickly overwhelm the processor with interrupt-handling work, leaving little time for getting useful tasks done. So most interface drivers will disable the per-packet interrupt when the traffic level is high enough and, with cooperation from the core networking stack, occasionally poll the device for new packets. There are a number of advantages to doing things this way: vast numbers of interrupts can be avoided, incoming packets can be more efficiently processed in batches, and, if packets must be dropped in response to load, they can be discarded in the interface before they ever hit the network stack. Polling is thus a win for almost all situations where there is any significant amount of traffic at all.

Extreme low-latency users see things differently, though. The time between a packet's arrival and the next poll is just the sort of latency that they are trying to avoid. Re-enabling interrupts is not a workable solution, though; interrupts, too, are a source of latency. Thus the drive for user-space solutions where an application can simply poll the interface for new packets whenever it is prepared to handle new messages.

Eliezer Tamir has posted an alternative solution in the form of the low-latency Ethernet device polling patch set. With this patch, an application can enable polling for new packets directly in the device driver, with the result that those packets will quickly find their way into the network stack.

The patch adds a new member to the net_device_ops structure:

    int (*ndo_ll_poll)(struct napi_struct *dev);

This function should cause the driver to check the interface for new packets and flush them into the network stack if they exist; it should not block. The return value is the number of packets it pushed into the stack, or zero if no packets were available. Other return values include LL_FLUSH_BUSY, indicating that ongoing activity prevented the processing of packets (the inability to take a lock would be an example) or LL_FLUSH_FAILED, indicating some sort of error. The latter value will cause polling to stop; LL_FLUSH_BUSY, instead, appears to be entirely ignored.

Within the networking stack, the ndo_ll_poll() function will be called whenever polling the interface seems like the right thing to do. One obvious case is in response to the poll() system call. Sockets marked as non-blocking will only poll once; otherwise polling will continue until some packets destined for the relevant socket find their way into the networking stack, up until the maximum time controlled by the ip_low_latency_poll sysctl knob. The default value for that knob is zero (meaning that the interface will only be polled once), but the "recommended value" is 50µs. The end result is that, if unprocessed packets exist when poll() is called (or arrive shortly thereafter), they will be flushed into the stack and made available immediately, with no need to wait for the stack itself to get around to polling the interface.

Another patch in the series adds another call site in the TCP code. If a read() is issued on an established TCP connection and no data is ready for return to user space, the driver will be polled to see if some data can be pushed into the system. So there is no need for a separate poll() call to get polling on a TCP socket.

This patch set makes polling easy to use by applications; once it is configured into the kernel, no application changes are needed at all. On the other hand, the lack of application control means that every poll() or TCP read() will go into the polling code and, potentially, busy-wait for as long as the ip_low_latency_poll knob allows. It is not hard to imagine that, on many latency-sensitive systems, the hard response-time requirements really only apply to some connections, while others have no such requirements. Polling on those less-stringent sockets could, conceivably, create new latency problems on the sockets that the user really cares about. So, while no reviewer has called for it yet, it would not be surprising to see the addition of a setsockopt() operation to enable or disable polling for specific sockets before this code is merged.

It almost certainly will be merged at some point; networking maintainer Dave Miller responded to an earlier posting with "I just wanted to say that I like this work a lot." There are still details to be worked out and, presumably, a few more rounds of review to be done, so low-latency sockets may not be ready for the 3.11 merge window. But it would be surprising if this work took much longer than that to get into the mainline kernel.

Index entries for this article
Kernel	Networking

Low-latency ethernet device polling

Posted May 23, 2013 12:27 UTC (Thu) by eliezert (subscriber, #35757) [Link]

"LL_FLUSH_BUSY, instead, appears to be entirely ignored."

The reason busy is ignored is that whoever has the lock is actively polling the device, so if they find something, this poller will see it on it's
sk->sk_receive_queue.

In essence, if we can't poll on the device queue, we poll on the sk queue.

Low-latency network techniques

Posted May 23, 2013 14:28 UTC (Thu) by jhhaller (guest, #56103) [Link] (5 responses)

To follow the path of low latency Ethernet, user initiated device polling is a good start, but insufficient. If one looks at the Intel Data Plane Development Kit (summarized in a few presentations at the last few IDF), there are a number of other techniques required to minimize latency. It's also important for the packets to be stored in a transparent huge page, so that there are few to no instances where the required page table entries are not in the processor's PTE cache (especially the IO-MMU). Having the actual data accessible to the application without requiring a copy also reduces latency, which means going through the network stack won't work, but the kernel then has to treat the application as a trusted entity, at least as far as the buffer pool is concerned.

For the ultimate in low latency, other avenues need to be explored, as any interrupts on the processor core handling the network traffic will affect latency. The direct DMA to cache in Intel's Sandy Bridge is also valuable, as it avoids any need for the packet to be stored or retrieved from RAM.

Other hardware-oriented approaches can be taken, but they still need to have the same characteristics - no interrupts, data sent directly to the memory space of the consuming process (ideally directly into cache), huge pages used for the data. The hardware techniques may be more suitable if the application cannot be trusted.

One might question why Linux is even used at all for this type of application, but there are typically high-velocity data input, and more complicated higher-level logic which takes advantage of all Linux has to offer. Public-facing services which have to withstand large DoS attacks is one example, the DoS traffic is discarded in the low-latency processing area, while the real traffic is processed elsewhere. The latency many not be as important for the traffic flows, but the volume of traffic requires low latency because there is no room to store all the packets if the latency is high.

Low-latency network techniques

Posted May 23, 2013 20:37 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

> To follow the path of low latency Ethernet, user initiated device polling is a good start, but insufficient. If one looks at the Intel Data Plane Development Kit (summarized in a few presentations at the last few IDF), there are a number of other techniques required to minimize latency.

The object is not to get the smallest possible latency (the way to do that is to not run Linux at all, but instead run your application on the bare metal)

The object is to improve the latency at an acceptable cost to the mainline. This will not make it suitable for all uses, but each improvement will 'fix' some portion of the uses that it was not suitable for.

This is like the argument over "real-time". For some uses, you need to guarantee that you will never have latency >1ms

for others, it's acceptable to have your latency under 100ms almost all the time and fail to meet this requirement once in a while.

It all depends on how common the 'failures' are and what the consequences of the failure are for your application.

"High Frequency Trading" doesn't require the minimum possible latency any more than it requires absolute security. I'm sure they loose some latency to implement security, and loose some security to minimize latency.

Low-latency network techniques

Posted May 29, 2013 8:26 UTC (Wed) by meuh (guest, #22042) [Link]

I'm sure they loose some latency to implement security, and loose some security to minimize latency.

You want to say they're trading latency for security and vice-versa.

Low-latency network techniques

Posted May 28, 2013 16:50 UTC (Tue) by meuh (guest, #22042) [Link] (2 responses)

There's a thing called RDMA, related, but not limited to, InfiniBand.
It even works on top of Ethernet as RoCE some even saw it on TCP/IP as iWarp.

RDMA rely on registered memory pages shared by the application and the network adapter.

https://www.openfabrics.org/resources/document-downloads/...

http://thegeekinthecorner.wordpress.com/2013/02/02/rdma-t...

Low-latency network techniques

Posted May 29, 2013 7:14 UTC (Wed) by eliezert (subscriber, #35757) [Link] (1 responses)

RDMA is great but,

you need to redesign you application in order to use it.
-sometimes this is a good thing :)
It is not compatible with regular IP, so you can't really use it on the internet.

Low-latency network techniques

Posted May 29, 2013 8:22 UTC (Wed) by meuh (guest, #22042) [Link]

You want low latency on the Internet ? Funny you ;)

BTW, sometimes you don't need to redesign your application to benefit from an RDMA enabled infrastructure (either InfiniBand, iWarp or RoCE), there's a work in progress called 'rsocket'. See

http://thread.gmane.org/gmane.linux.drivers.rdma/11627
https://www.openfabrics.org/ofa-documents/doc_download/49...
http://linux.die.net/man/7/rsocket

(There was also a thing called Socket Data Placement aka SDP, but rsocket is going to replace it for good).