The block I/O latency controller

By Jonathan Corbet
July 5, 2018

Large data centers routinely use control groups to balance the use of the available computing resources among competing users. Block I/O bandwidth can be one of the most important resources for certain types of workloads, but the kernel's I/O controller is not a complete solution to the problem. The upcoming block I/O latency controller looks set to fill that gap in the near future, at least for some classes of users.

Modern block devices are fast, especially when solid-state storage devices are in use. But some workloads can be even faster when it comes to the generation of block I/O requests. If a device fails to keep up, the length of the request queue(s) will increase, as will the time it takes for any specific request to complete. The slowdown is unwelcome in almost any setting, but the corresponding increase in latency can be especially problematic for latency-sensitive workloads.

The kernel has a block I/O controller now, but it has a number of shortcomings. It regulates bandwidth usage, not latency; that can be good in settings where users are being charged for higher bandwidth limits, but it is less useful for workloads where latency matters. If some groups do not use their full bandwidth allocations, a block I/O device may go idle even though other groups, which have hit their limits, have outstanding I/O requests. The block I/O controller also depends heavily on the CFQ I/O scheduler and loses functionality in its absence. It doesn't work at all with multiqueue block devices — the type of devices most likely to be in use in settings where the I/O controller is needed.

The I/O latency controller, written by Josef Bacik, addresses these problems by regulating latency (instead of bandwidth) at a relatively low level in the block layer. When it is in use, each control group directory has an io.latency file that can be used to set the parameters for that group. One writes a line to that file following this pattern:

    major:minor target=target-time

Where major and minor identify the specific block device of interest, and target-time is the maximum latency that this group should experience (in milliseconds).

The controller tracks the actual latency seen by each group, using a relatively short (100ms) window. If a given group starts to miss its target, all other peer groups with larger targets are throttled to free up some bandwidth; the group with the tightest latency target is thus given the highest priority for access to the device. If all groups are meeting their targets, no throttling is done, so no bandwidth should go to waste if there is a need for it.

On its face, throttling block I/O seems like a straightforward task: if a process needs to be slowed down, simply don't dispatch as many of its requests to the device. But block I/O is a bit strange in that much of it is initiated outside of the context of the process that is ultimately responsible for its creation. One example is filesystem metadata I/O, which is generated by the filesystem itself at a time of its own choosing. Slowing down that I/O may interfere with the filesystem's ordering decisions and create locking problems — without slowing down the target process at all. I/O generated by swapping is another example; it is generated when the kernel needs to reclaim memory, which may not be when the process being swapped is actually running. Slowing down swap I/O will slow down the freeing of memory for other uses — not a particularly good idea when the system is short of memory.

Kernel developers who introduce that kind of behavior have a relatively high likelihood of needing to look for openings in the fast-food industry in the near future. So the latency controller does no such thing. It will delay I/O dispatch for I/O that is generated directly by a process running inside a control group that is to be throttled. So a process reading rapidly from a file may find that its reads start taking longer when throttling goes into effect, for example.

A different approach is needed for indirectly generated block I/O, though. In such cases, the latency controller will record the amount of needed delay in the control group itself. Whenever a process running within that group returns from a system call — a setting where it is known that no locks are held — that process will be put to sleep for a period to pay back some of that delay. The sleep period can be as long as 250ms in severe cases. If I/O traffic eases up and throttling is no longer necessary, any remaining delays will be forgotten.

In the patch introducing the controller itself, Bacik notes that using it results in a slightly higher number of requests per second (RPS) overall, and a significant reduction in variability of RPS rates over time. There is another interesting result, in that this controller can help to protect the system against runaway processes:

Another test we run is a slow memory allocator in the unprotected group. Before this would eventually push us into swap and cause the whole box to die and not recover at all. With these patches we see slight RPS drops (usually 10-15%) before the memory consumer is properly killed and things recover within seconds.

The throttling, seemingly, slows the allocating process enough to allow the OOM killer to do its job before the system runs completely out of memory.

This patch set has been through six revisions as of this writing, with some significant changes in the implementation happening along the way. That work appears to be coming to a close, though. It earned the elusive Quacked-at-by tag from Andrew Morton, and block maintainer Jens Axboe has indicated that it has been applied for the 4.19 development cycle. So the latency for the delivery of the block I/O latency controller would appear to be three or four months at this point.

Index entries for this article
Kernel	Control groups/I/O bandwidth controllers

The block I/O latency controller

Posted Jul 6, 2018 7:01 UTC (Fri) by mjthayer (guest, #39183) [Link] (2 responses)

Is this work likely to have any effect on I/O latency outside of control group contexts? Especially when building VirtualBox I quite regularly get my system to a point where I no longer even know whether it is still responsive and recovering from a reboot is faster than finding out, as in

https://bugzilla.kernel.org/show_bug.cgi?id=48841

but recently much worse. (Disclaimer: I wasn't joking with "I know longer even know". So perhaps I am sometimes really triggering some strange hangs. Pretty sure it isn't hardware, as it has happened on too many different systems.)

The block I/O latency controller

Posted Jul 6, 2018 8:44 UTC (Fri) by arnd (subscriber, #8866) [Link] (1 responses)

I think the answer to your problem is BFQ, not not control groups.

The block I/O latency controller

Posted Jul 6, 2018 9:29 UTC (Fri) by mjthayer (guest, #39183) [Link]

Thank you.

The block I/O latency controller

Posted Jul 6, 2018 11:22 UTC (Fri) by smurf (subscriber, #17840) [Link]

> The throttling, seemingly, slows the allocating process enough to allow the OOM killer
> to do its job before the system runs completely out of memory.

Patches which fix seemingly-unrelated problems instead of introducing them are good.
More, please. ;-)

The block I/O latency controller

Posted Jul 6, 2018 13:23 UTC (Fri) by josefbacik (subscriber, #90083) [Link] (3 responses)

Sorry it's not entirely clear by my patches, but it isn't the kernel's OOM killer that kills the process, it's our OOMD service that monitors the pressure statistics in the cgroups (the pressure patches aren't upstream yet, as soon as they are I'll hook in the relevant stuff for blk-iolatency). The in-kernel OOM killer is pretty terrible at knowing who to kill, which is why we get that stupid death spiral when we don't have cgroups in place to limit things. OOMD is able to make smarter decisions based on the pressure numbers to know who's causing problems and kills the memory consumer when things get out of hand.

The block I/O latency controller

Posted Jul 7, 2018 14:40 UTC (Sat) by jhoblitt (subscriber, #77733) [Link] (2 responses)

Googling "OOMD service linux" turns up the parent comment in the first page of results. Perhaps an LWN article on the OOMD would be well received?

The block I/O latency controller

Posted Jul 7, 2018 15:11 UTC (Sat) by corbet (editor, #1) [Link]

OOMD isn't released anywhere, as far as I know. Johannes's pressure stall patches have been on my radar for a bit; I'll definitely write about those, but the current set is a bit old, so I've been waiting for a repost.

The block I/O latency controller

Posted Jul 9, 2018 11:46 UTC (Mon) by josefbacik (subscriber, #90083) [Link]

OOMD isn't open sourced yet, still waiting on the pressure patches to land upstream. But it basically just watches the pressure statistics, waits for some threshold to be hit, murders anything that hits that threshold. In our reproducer you'll see that the memory bomb application is spending more and more time waiting on memory, and when it passes the 80% mark it'll be killed.

The block I/O latency controller

Posted Jul 14, 2018 17:56 UTC (Sat) by ljishen (guest, #119563) [Link]

This helps to config the IO performance guarantee for some workload.