Restartable sequences restarted

By Jonathan Corbet
August 24, 2016

"Restartable sequences" is starting to look a bit like one of those bright ideas that floats around on the kernel list for years, but which never quite seems to make it into the mainline. In this case, the idea was first proposed over one year ago without, yet, having made appreciable progress toward merging; activity on this patch set died down after a while. But development on restartable sequences has picked up again under a new developer who has come up with yet another API for the feature.

As has happened in the kernel, scalability pressures are driving some user-space applications toward the use of lockless algorithms. In kernel space, such algorithms tend to be based on either disabling preemption or retrying an operation after contention is detected. Disabling preemption in user space is not an option, so retries are the primary option remaining. That is where restartable sequences come in; they combine a kernel-facilitated mechanism for detecting possible contention with a means to quickly force a retry when contention happens.

The current version of restartable sequences, as posted by Mathieu Desnoyers, retains the core idea of its predecessors. A restartable sequence is based around a short segment of code; only the final instruction of that segment is allowed to have side effects visible outside of the current thread. There is also an abort sequence, called to clean up and retry should the thread be preempted while executing the sequence. The specifics have changed, though.

Code using restartable sequences needs to start with an rseq structure:

    struct rseq {
    	int32_t cpu_id;
	uint32_t event_counter;
        struct rseq_cs *rseq_cs;
    };

(The actual structure is a bit more complex; various architecture-specific details have been omitted here in the interest of readability.) The cpu_id field always contains the number of the CPU on which the thread is running; event_counter is incremented whenever the thread is preempted — but only if rseq_cs is not null. The purpose of rseq_cs will be discussed below.

This structure must be registered with the kernel before restartable sequences can be used; the operative system call is:

    int rseq(struct rseq *rseq, int flags);

Only one rseq structure can be registered at a time in any given thread, but that structure can be registered multiple times, and the kernel will keep track of how many registrations (and unregistrations) there have been. The flags argument must be zero when registering a new structure. Unregistration is done by passing a null pointer for the rseq structure; setting flags to RSEQ_FORCE_UNREGISTER will cause the immediate removal of the structure, even if it has been registered multiple times.

In the past there have been concerns about how the restartable sequences feature would work when there are multiple users within an application (libraries, for example) that do not know about each other. If those users fight over which rseq structure is used, there will be problems with this interface as well; if, instead, they can all agree on the same structure, all will be well. Restartable sequences must be simple, so it makes no sense for code running within one to call another function at all, much less one that would start its own sequence. So there can only be a single sequence running at any given time.

To ensure that all users share a single rseq structure, the documentation recommends that each user declare it as a weak symbol and name it __rseq_abi. The linker will then ensure that, if there are multiple declarations within a given program, they will all refer to the same structure.

The other half of the puzzle is the rseq_cs structure pointed to from within the rseq structure above. This structure looks like (again, with some simplification applied):

    struct rseq_cs {
        void *start_ip;
	void *post_commit_ip;
	void *abort_ip;
    };

This structure describes an actual critical section that runs in the restartable mode. Here, start_ip is the address of the first instruction in the section, and post_commit_ip is the first instruction beyond the end of the section; any code running between those two instructions is running within the critical section. The abort_ip pointer is the address of the cleanup code to be executed should the thread be preempted while executing within the section.

With those pieces, a restartable sequence is run using something like this sequence of steps (assuming that the rseq structure is already registered):

The event_counter field from the rseq structure is read and saved.
The rseq_cs pointer in the rseq structure is set to point to the rseq_cs structure describing the critical section to be executed.
The event_counter is read again and compared to the value read previously; if the values do not match, the rseq_cs field should be cleared and the process must be restarted from the beginning.
The critical section can now be executed. In most cases, only the final instruction in the critical section should have visible side effects.
The rseq_cs field should be set to NULL.

If execution makes it past the end of the section, then all is well. If, instead, the thread is preempted while running within the critical section, the kernel will cause it to jump to the abort_ip address. The code found there should clean up and prepare to retry.

In principle, that is all there is to it. In practice, applications using this feature must still include some assembly code to set up the various instruction pointers; there is some complexity involved in making it all work properly. Those interested in examples can have a look at the self-tests included with the patch and, in particular, the rather frightening assembly-in-CPP code found here and here.

There have not been many comments on the implementation this time around; it seems that, perhaps, things are finally getting to a point where the developers who are paying attention are reasonably happy. The next obstacle, though, may be Linus, who wants more evidence that this is a feature that will actually be used. Convincing him is likely to require demonstrating some real-world code that benefits from the feature and benchmarks to prove that it is all worthwhile. Since restartable sequences are said to have been in use in places like Google for some time, that proof should be possible to come by. If the developers involved follow through, perhaps this sequence of patches will not need to be restarted too many more times.

Index entries for this article
Kernel	Restartable sequences

Restartable sequences restarted... with "frightening code"

Posted Aug 25, 2016 2:55 UTC (Thu) by pr1268 (subscriber, #24648) [Link] (1 responses)

the rather frightening assembly-in-CPP code found here and here.

When our esteemed editor starts describing prospective kernel code as "frightening", then I think it's time to really be scared.

In other words, don't try the self-tests without bracing yourself for possible blindness when browsing the code. ;-)

Regardless, this looks like a useful feature. Hopefully it gains enough traction to make it in...

Restartable sequences restarted... with "frightening code"

Posted Aug 31, 2016 19:41 UTC (Wed) by nix (subscriber, #2304) [Link]

This is not kernel code. This is *userspace* code. i.e. this is something the *users* of this feature have to do (or, more likely, use this very code, migrated to a proper library, to do).

re-registration

Posted Aug 29, 2016 3:00 UTC (Mon) by lkurusa (guest, #97704) [Link] (1 responses)

What is the rationale behind letting people re-register their struct rseq? Isn't this something akin to recursive locks?

re-registration

Posted Aug 29, 2016 9:50 UTC (Mon) by corbet (editor, #1) [Link]

I guess I didn't make that clear enough. The point is to let unrelated libraries all use this facility without stepping on each other's toes. The kernel can only handle one such structure per thread, but, by allowing nested registration and through the use of the __rseq_abi trick, independent users can all get along.

Restartable sequences restarted

Posted Oct 19, 2023 11:23 UTC (Thu) by bpearlmutter (subscriber, #14693) [Link]

I'd be concerned that this kind of mechanism can be used to get inappropriate access to data, by monitoring the number of restarts while setting things up so data you shouldn't have access to alters the probability of getting restarted.