Expediting membarrier()
membarrier()
membarrier() was first discussed in 2010. The initial use case was to support user-space RCU, which uses a shared-memory variable to indicate that a thread is running in an RCU critical section. Changes to RCU-protected objects (and, in particular, the freeing of the old version of a changed object) cannot happen while any thread is in an RCU critical section, so code that performs such an operation must check this shared flag to ensure that the change is safe. This scheme can be thwarted, though, if the processor reorders operations, causing the object to be freed before the variable is checked.
Processors provide memory-barrier instructions so that this kind of scenario can be prevented. Unfortunately, these instructions are relatively slow, since they must serialize access across the entire machine. Memory barriers must also occur in pairs to function properly; in this case, one barrier would be needed whenever a thread sets the "in RCU critical section" flag, while the other would happen after that flag is checked, but before any subsequent action is taken. This symmetric pairing of barriers works well in many situations, but it is poorly suited to the RCU use case in particular.
The problem comes from the fact that entry into an RCU critical section is a frequent occurrence, while changes to RCU-protected objects can be quite rare. So it is possible that hundreds (or more) rcu_read_lock() calls will be made where no thread is trying to change the protected objects; in such cases, all of the overhead incurred by those memory barriers is wasted. In situations where this sort of asymmetrical access pattern pertains, it would be worthwhile to greatly increase the cost of a memory-barrier operation — if that cost could be moved entirely to the thread performing the change, allowing the read path to be fast.
That is where membarrier() comes in. The initial version simply sent an inter-processor interrupt (IPI) causing every processor to execute a memory-barrier instruction. That approach was not entirely popular, since the IPIs wake every processor on the system and can cause unexpected latencies for realtime threads. Subsequent discussion caused the implementation to shift to calling synchronize_sched(), a kernel function that, among other things, ensures that every processor will have executed a memory barrier. At the time, the patches included an "expedited" option that would use IPIs instead, but when membarrier() was merged (many years later, in 2015), that option was not included.
The expedited option
Recently, Paul McKenney posted a patch adding the expedited option back to membarrier(). This change raised some eyebrows, since the concerns about IPIs have not gone away. Mathieu Desnoyers, the original author of the membarrier() patch, asked how it was possible to offer the expedited option without impacting realtime processes, and Peter Zijlstra worried about the denial-of-service attack that can be carried out by code as simple as:
for (;;) membarrier(MEMBARRIER_CMD_SHARED_EXPEDITED, 0);
At the moment, it would seem, there are no new answers to any of those questions, but there is a stronger incentive to add the expedited option, and appears that this option is not creating any problems that do not already exist.
As McKenney described it, there are a number of users who are finding that the existing membarrier() system call is too slow. That is perhaps unsurprising; synchronize_sched() will force the calling thread to block until every CPU in the system goes through an RCU grace period, so there is a certain amount of latency built in. These users have found a trick to get the desired behavior without calling membarrier(): they make a call to either mprotect() or munmap() instead. Either of those system calls will, on an x86 system, cause an IPI to be issued to ensure that the affected address ranges are removed from each translation lookaside buffer (TLB). They also cause a certain amount of useless memory-management overhead but, evidently, the end result is still faster than calling membarrier().
Besides its fundamental inelegance, this approach has a couple of problems.
One is that it could easily break in future kernels or on future hardware
if those system calls can be made to work without IPIs; if such an
optimization opportunity presents itself, the kernel developers are highly
likely to take it. In fact, the IPIs are not necessary on all current
hardware, leading McKenney to note that this trick "has the slight
disadvantage of not working at all on arm and arm64
". Adding the
IPI capability to membarrier() will allow for better performance
on all architectures without the need to resort to tricks.
Since users can already create IPIs at will with the memory-management
calls, McKenney does not believe that adding that ability to
membarrier() will make things worse. But there are, he said, a few
things that could be done to reduce the potential for abuse of the
expedited option. These range from complete "defanging
" by
disabling expedited grace periods at boot time to limiting the number of
expedited membarrier() calls that can be made in a given time
period. Various approaches to limiting the IPIs to the processors that
actually need to receive them (those processors actually running threads
from the application calling membarrier()) are also under
consideration.
Providing a mechanism for expedited barriers will, at least, give
the kernel community the possibility of handling any abuse.
This is a patch that is likely to go through further revisions and
discussion before it makes it close to the mainline. Among other things,
the people who have been calling for a faster membarrier() need to
verify that the expedited option solves their problem. "Obviously,
unless there are good test results and some level of user enthusiasm, this
patch goes nowhere
", McKenney said.
The actual code, at the moment, fits on a single screen; the
discussion around it seems unlikely to be so concise.
Index entries for this article | |
---|---|
Kernel | membarrier() |
Posted Jul 26, 2017 17:41 UTC (Wed)
by josh (subscriber, #17465)
[Link] (1 responses)
This seems like the ideal approach; interrupting a CPU that's already running application code seems entirely fine. The DoS concerns all relate to the ability to interrupt a CPU that *isn't* running application code.
Posted Jul 26, 2017 21:05 UTC (Wed)
by smckay (guest, #103253)
[Link]
Posted Jul 27, 2017 5:16 UTC (Thu)
by ncm (guest, #165)
[Link] (2 responses)
Posted Jul 27, 2017 13:38 UTC (Thu)
by corbet (editor, #1)
[Link] (1 responses)
Posted Jul 27, 2017 14:44 UTC (Thu)
by ncm (guest, #165)
[Link]
It takes so long to change mappings... maybe a list of processors that already have the page of interest mapped, that are favored to run processes that have used it lately? Maybe this sounds too much like Touching The Scheduler.
Posted Sep 10, 2017 18:51 UTC (Sun)
by eSyr (guest, #112051)
[Link]
Expediting membarrier()
Expediting membarrier()
Expediting membarrier()
Remember, this is interrupting processors, not processes. There is an ongoing conversation about sending IPIs only to processors that might be running a process that has that memory mapped, but that appears to not be an entirely easy thing to do on all architectures.
Expediting membarrier()
Expediting membarrier()
Expediting membarrier()