Deadline scheduling: coming soon?

By Jonathan Corbet
December 4, 2013

Deadline scheduling was first covered here in 2009. Like much of the code in the realtime tree, though, deadline scheduling appears not to be subject to deadlines when it comes to being merged into the mainline. That said, it seems entirely possible that this longstanding project will land in a stable kernel release fairly soon, so a look at the status of this patch set, and the proposed ABI in particular, seems in order.

To recap briefly: deadline scheduling does away with the concept of process priorities that has been at the core of most CPU scheduler algorithms. Instead, each process provides three parameters to the scheduler: a "worst-case execution time" describing a maximum amount of CPU time needed to accomplish its task, a period describing how often the task must be performed, and a deadline specifying when the task must first be completed. The actual scheduling algorithm is then relatively simple: the task whose deadline is closest runs first. If the scheduler takes care to not allow the creation of deadline tasks when the sum of the worst-case execution times would exceed the amount of available CPU time, it can guarantee that every task will be able to finish by its deadline.

Deadline scheduling is thus useful for realtime tasks, where completion by a deadline is a key requirement. It is also applicable to periodic tasks like streaming media processing.

In recent times, work on deadline scheduling has been done by Juri Lelli. He has posted several versions, improving things along the way. His v8 posting in October generated a fair amount of discussion, including suggestions from scheduler maintainers Peter Zijlstra and Ingo Molnar that the time had come to merge this code. That merging did not happen for 3.13, but chances are that it will for a near-future kernel release, barring some sort of unexpected roadblock. The main thing that remains to be done is to get the user-space ABI nailed down, since that aspect is hard to change after it has been released in a mainline kernel.

Controlling the scheduler

To be able to guarantee that deadlines will be met, a deadline scheduler must have an incontestable claim to the CPU, so deadline tasks will run ahead of all other tasks — even those in the realtime scheduler classes. Deadline-scheduled processes cannot take all of the available CPU time, though; the amount of time actually available is controlled by a set of sysctl knobs found under /proc/sys/kernel/. The first two already exist in current kernels: sched_rt_runtime_us and sched_rt_period_us. The first specifies the amount of CPU time (in microseconds) available to realtime tasks, while the second gives the period over which that CPU time is available. By default, 95% of the total CPU time is made available to realtime processes, leaving 5% to give a desperate system administrator a chance to recover a system from a runaway realtime process.

The new sched_dl_runtime_us knob is used to say how much of the realtime allocation is available for use by the deadline scheduler. The default setting allocates 40% for deadline scheduling, but a system administrator may well want to tweak that value. Note that, while realtime scheduling is supported by control groups, deadline scheduling has not yet been implemented at that level. How deadline scheduling should interact with group scheduling raises some interesting questions that have not yet been fully answered.

The other piece of the ABI allows processes to enter and control the deadline scheduling regime. The current system call for changing a process's scheduling class is:

    int sched_setscheduler(pid_t pid, int policy, const struct sched_param *param);

The sched_param structure used in this system call is quite simple:

    struct sched_param {
	int sched_priority;
    };

So sched_setscheduler() works fine for the currently available scheduling classes; the desired class is specified with the policy parameter, while the associated process priority goes into param. But struct sched_param clearly does not have the space needed to hold the three parameters needed with deadline scheduling, and its definition cannot be changed without breaking the existing ABI. So a new system call will be needed. As of this writing the details are still under discussion, but the new ABI can be expected to look something like this:

    struct sched_attr {
 	int sched_priority;
 	unsigned int sched_flags;
 	u64 sched_runtime;
 	u64 sched_deadline;
 	u64 sched_period;
	u32 size;
    };

    int sched_setscheduler2(pid_t pid, int policy, const struct sched_attr *param);
    int sched_setattr(pid_t pid, const struct sched_attr *param);
    int sched_getattr(pid_t pid, struct sched_attr *param, unsigned int size);

Where size (as both a parameter and a structure field) is the size of the sched_attr structure. If, in the future, the need arises to add more fields to that structure, the kernel will be able to use the size value to determine which version of the structure an application is using and respond accordingly. For the curious: size is meant to be specified within struct sched_attr when that structure is, itself, an input parameter to the kernel; otherwise size is given separately. The sched_flags field of struct sched_attr is not used in the current version of the patch.

One other noteworthy detail is that processes running in the new SCHED_DEADLINE class are not allowed to fork children in the same class. As with the realtime scheduling classes, this restriction can be worked around by setting the scheduling class to SCHED_DEADLINE|SCHED_RESET_ON_FORK, which causes the child to be placed back into the default scheduling class. Without that flag, a call to fork() will fail.

Time to merge?

The deadline scheduling patch set has a number of loose ends left to be dealt with, many of which are indicated in the patches themselves. But there comes a point where it is best to go ahead and get the code into the mainline so that said loose ends can be tied down more quickly; the deadline scheduling patches may well have reached that point. Since deadline scheduling can be added without much risk of regressions on systems where it is not in use, there should not be a whole lot more that needs to be dealt with before it can be merged.

...except, maybe, for one little thing. When deadline scheduling was discussed at the 2010 Kernel Summit, Linus and others clearly worried that there may not be actual users for this functionality. There has not been a whole lot of effort put into demonstrating users for deadline scheduling since then, though it is worth noting that the Yocto project has included the patch in some of its kernels. The JUNIPER project is also planning to use deadline scheduling, and has been supporting its development. Users like these will definitely help the deadline scheduler's case; Linus has become wary of adding features that may not actually be used. If that little point can be adequately addressed, we may have deadline scheduling in the mainline in the near future.

Index entries for this article
Kernel	Scheduler/Deadline scheduling

Deadline scheduling: coming soon?

Posted Dec 4, 2013 19:35 UTC (Wed) by bokr (subscriber, #58369) [Link] (7 responses)

1. When does hard real-time not require deadlines?

2. I am wondering if complex policy and critical path stuff wouldn't best be done in userland at high realtime priority I.e., let such a process do the figuring and policy implementation planning to meet deadlines, and give it an ABI to tell the OS what to run in absolute priority starting when. Then it can yield its own CPU by sleeping a planned amount, as an inverted scheduling of other work.

Deadline scheduling: coming soon?

Posted Dec 4, 2013 23:27 UTC (Wed) by smurf (subscriber, #17840) [Link] (6 responses)

> 1. When does hard real-time not require deadlines?

If you have infrequent events which you must react to _right_now_, you definitely need "hard" real time, yet a deadline scheduler is useless.

Deadline scheduling: coming soon?

Posted Dec 5, 2013 4:55 UTC (Thu) by tseaver (guest, #1544) [Link] (5 responses)

"Hard" realtime implies deadlines by definition[1]; if deadlines aren't
applicable to your problem, you are doing something else.

]1] http://en.wikipedia.org/wiki/Hard_realtime#Hard

Deadline scheduling: coming soon?

Posted Dec 5, 2013 10:38 UTC (Thu) by kugel (subscriber, #70540) [Link]

Right, row-latency response is not the same as hard realtime, though one often wants both.

Deadline scheduling: coming soon?

Posted Dec 5, 2013 11:16 UTC (Thu) by yaap (subscriber, #71398) [Link] (3 responses)

Yes, hard real-time implies deadlines. But there is a difference between deadlines known in advance, and reactive deadlines where an unexpected asynchronous event must be processed within a given response time. Both are cases of hard real-time, where there is a clear deadline on some processing. EDF scheduling is a natural match for the first case, but a preemptive scheduler is fine for the second case.

Also, even for the first case EDF is not a requirement. One can use a preemptive scheduler with rate monotonic analysis (see https://en.wikipedia.org/wiki/Rate-monotonic_scheduling) to make sure deadlines are met. And from memory it seems this has been done: provide an EDF like library API that will check that RMA conditions are met, but just allocate a priority and run on a preemptive scheduler.

So although hard RT is about deadlines and EDF has the proper keyword in its name, it is never a necessity. That may explain why there is little traction.

Deadline scheduling: coming soon?

Posted Dec 5, 2013 13:03 UTC (Thu) by hasta2003 (subscriber, #76829) [Link] (2 responses)

Two things on why EDF (actually EDF + Constant Bandwidth Server = SCHED_DEADLINE) is better than RM for Hard and Soft RT:

- it can better utilize CPU power (optimal on UP, 100% of CPU power while still meeting deadlines, RM ~ 69%)
- it can provide temporal guarantees, which plain EDF or RM cannot do
- it provides temporal isolation, misbehaving tasks cannot affect your high priority activities

The RT literature is quite vast in this regard, something more can be found in Documentation (https://lkml.org/lkml/2013/11/7/271). Just let me know if you want more details :).

Thanks,

- Juri

Deadline scheduling: coming soon?

Posted Dec 6, 2013 1:51 UTC (Fri) by bokr (subscriber, #58369) [Link] (1 responses)

1. What if execution time ranges are not uncorrelated? I.e., what if scheduling a before b guarantees best time for b (e.g. warming cache), but there is little or no effect on a if b is scheduled first but b first gives random or worst time for b itself? Then there is an optimum sequence.

2. Deadlines to me fall into categories that show up more easily in the concrete real time scheduling problem of cooking and serving a dinner. One deadline is when everyone is seated and baked potato and grilled steaks and steamed veggies (with a cold pat of butter on top) all have to be served near simultaneously. This entails other deadlines and responses to events. Deadline for turning on oven. Hot enough to put in potatoes. Etc.

The general pattern is resource available, lax schedule for some prepping, cooking for a fixed interval with strict deadline at the end, not for starting, so the start has to be computed back, but it is important to start accurately so it will end accurately. The potatoes take the longest, the steaks probably parallel to the broccoli. You get the picture.

Also note that the oven will be hot, so dessert souffle cook should take that into account. Worst case cooking time is not uncorrelated unless you have uncorrelated start conditions.

3. I can imagine a situation where you are supplying services with different prices according to quality. The cheapest might schedule single processors and simple low bitrate algorithms, and not meeting deadlines is just breakups in the phone conversation or frozen frames instead of keeping up with jump-cut keyframes, or whatever.

The question is if the scheduling ABI will support building a server that does both high and low QOS processing at the same time, using more CPUs and power for the high paying customer.

In terms of GUI experience, one frame is a dinner, broccoli might be a video overlay of stock ticker updated one marquee shift, etc. But everything has to be cooked and ready/consistent at buffer flip time.

4. Can the EDF be tiered into separate priorities, so that high priority tasks run first, with maybe leftover time trickling to lower levels?

5. Priority is not to be used to control logical sequence order, but sometimes a ready queue that is guaranteed to run in logical fifo/roundrobin sequence is cheaper than to guarantee access coordination by locks. Maybe just use locks and not worry about it?

Deadline scheduling: coming soon?

Posted Dec 6, 2013 10:03 UTC (Fri) by hasta2003 (subscriber, #76829) [Link]

Interesting questions, some have already got answers, some others are being studied in these days (in the RT literature). What you are talking about sounds to me like pipelines of tasks, group/hierarchical scheduling and gang scheduling. I'm not saying that this is already doable using the new mechanisms, but they are the base for everything.

Let's say that in your restaurant you want to cook several dinners at the same time, and you want to assign them a different priority (friends and restaurant reviewers have to eat first and at least 10 minutes after they ordered), that's group scheduling: all activities for a single dinner receive higher priority, same could apply to Virtual Machines on a shared host. A feature we are working on and will extend soon SCHED_DEADLINE capabilities.

SCHED_DEADLINE can already allow sporadic activations: you can decide your task is activated by the completion of some other task, or when a timer fires (put broccoli in the pot after potatoes have been in the oven for 10 minutes). You just assign a deadline for them, and they will be scheduled among other tasks, no worries about relative priorities (like you probably have to do with FIFO/RR, e.g., are my broccoli more important than the steak?).

Then you may also want to assemble a cake layer after layer. Base in 1 minute, put cream on top (after base) in 30 seconds, and pass it to the next stop where it will be covered by a chocolate frosting in not more than 15 sec. That sounds to me like a pipeline of tasks, and have already been studied. Different approaches are possible (holistic view, intermediate deadlines, etc.), all of them can make use of SCHED_DEADLINE.

Lastly, let's say you want to just say "prepare a vegetarian couscous in not more than 30 minutes" and you don't want to care about parallel steps. Someone has to cook the couscous, some other has to cut onions, carrots, zucchini, etc. and then someone has to cook them. At the end all have to be put together again in a single dish a be served to the customer. There is lot of interest in this right now (OpenMP, etc.) and how to efficiently assign deadlines to parallel tasks is not yet decided, but the building blocks are already available using the new scheduling policy.

After all, you can also still use different policies for different activities, and we are actually working on how to correctly and safely make priority trickle (proxy exec being a promising candidate).

Deadline scheduling and virtualization

Posted Dec 4, 2013 19:53 UTC (Wed) by jhhaller (guest, #56103) [Link] (2 responses)

An interesting point (to me, at least) would be how one could use this API to support deadline scheduling in a virtualized guest. While the simple approach would be to lock the guest to particular cores, and keep other work away from those cores, that prevents using those cores for low-priority work.

A more complicated guest would need to use a para-virtualized scheduler, so that multiple deadlines can be passed to the hypervisor, and somehow pass that data to the kernel. Ideally, the guest would pass the priority of it's currently running thread to the hypervisor, so it could properly schedule the process with the native OS, but probably needs to register multiple priorities and deadlines, so that the native OS will escalate the priority of the hypervisor so it will schedule the hypervisor when it's time for it's guest OS to run a high priority task. Keeping the hypervisor and guest OS priorities synchronized will be a challenge, as the guest OS could finish it's high priority tasks early.

This seems to be a case where cgroups will work better than strict virtualization, once deadline scheduling is implemented there, but cgroups have some of the same type of implementation issues as hypervisors, but give the OS more information about thread-level scheduling information.

Deadline scheduling and virtualization

Posted Dec 4, 2013 21:09 UTC (Wed) by nevets (subscriber, #11875) [Link]

Perhaps using OSv would be useful here. As OSv does not allow fork as well, but you get a virtualized OS that run a single process with all the access to the virtual hardware (think a virtualized DOS). Having several of these instances under a deadline scheduler will allow them to get a guaranteed amount of CPU.

Deadline scheduling and virtualization

Posted Dec 5, 2013 10:16 UTC (Thu) by Lennie (subscriber, #49641) [Link]

Virtualization has always been messy. There is no proper way to do it, it's pretty messy.

It's one kernel trying to communicate with an other kernel. So the best way is to create a some kind of para-virtualized API.

Best way to get any guarantees might actually be to use containers instead.

Deadline scheduling: coming soon?

Posted Dec 5, 2013 11:02 UTC (Thu) by hasta2003 (subscriber, #76829) [Link]

Thanks Jonathan for this report!

I'd just like to add that, for anyone that wants to experiment with the new ABI, branch "new-ABI" on https://github.com/jlelli/sched-deadline is the most updated source.

As usual, any kind of feedback is highly appreciated (especially if it comes with a nice use-case :)).

Deadline scheduling: coming soon?

Posted Dec 5, 2013 13:56 UTC (Thu) by iq-0 (subscriber, #36655) [Link] (6 responses)

Why is the old syscall not sufficient?

The structure is passed as a pointer. If you consider the the type of the structure that is pointed to to be dependant on the policy, than there should not be a problem. You only need assistance of the policy logic to sensibly access the value.

And why is the size not the first parameter in the structure? A structure like this would be more sensible:

struct sched_attr {
u32 size;
u32 sched_flags;
union {
struct sched_params,
struct deadline_params,
struct foobar_params,
}
}

If you need some big values in the structure for some future policy (change) than you don't have to workaround some static field in the middle of the structure. And given a size you're expected to always check it (and any possible flags) before interpreting the other fields.

Old syscall

Posted Dec 5, 2013 14:19 UTC (Thu) by corbet (editor, #1) [Link]

I can imagine a few reasons for not wanting to use a variable-sized structure with the old syscall. First of those is simple type safety; you'd lose the ability to check the type of the "params" argument in the compiler.

Beyond that, though, how do you make something like sched_getparam() safe? You might assume that an application will never get put into an unexpected scheduler class that would cause the overflow of the smaller structure size, but such assumptions have proved dangerous in the past.

Deadline scheduling: coming soon?

Posted Dec 6, 2013 3:21 UTC (Fri) by dashesy (guest, #74652) [Link] (4 responses)

Yes size should be the first to be of any value.
Providing the size is something very often seen in Win32 API. One problem with providing the size, in general, is that sizes can only grow so perhaps a version number would be more helpful. The other problem is people start using sizeof() all over places, and memory layout and packed attribute also take effect. Win32 code is usually not pretty.
But isn't it better not to predict future too much, it will not hurt to add a new syscal in the future if needed, which comes to politics I guess. Also the norm is to add a number to the end of the old variant (e.g. dup and dup2, sched_setscheduler and sched_setscheduler2 and sched_setscheduler3), so that will fit better with overall picture too.

Deadline scheduling: coming soon?

Posted Dec 6, 2013 8:40 UTC (Fri) by iq-0 (subscriber, #36655) [Link]

That's almost treating the syscall interface as symbol versioning. Sure it helps if the old interface was too restrictive, but in this case they just want a system for allowing different parameters for different scheduling policies. And a new syscall for each scheduling policy is not really sensible. In that case just add a syscall per scheduling policy and don't try to be generic about it.

I agree that version would probably be better than size (say something else changes but the structure remains the same size). And in that case you could possibly even drop the flags part (if you need it, change the version).

A size could still be usefull, but that would be to make the generic syscall wrapper copy all the supplied data from userspace to prevent concurrent updates on the userspace structure from confusing the kernel checks. But in that case I'd add it as a parameter in the syscall. It still wouldn't exclude a version in the structure though.

Deadline scheduling: coming soon?

Posted Dec 6, 2013 9:13 UTC (Fri) by hasta2003 (subscriber, #76829) [Link] (2 responses)

We have both:

- version number is same as size: #define SCHED_ATTR_SIZE_VER0 40 /* sizeof first published struct */
- a new syscall, called sched_setscheduler2, that uses the new ABI, and that is extensible (just one, not a new sched_setschedulerX for every new scheduling class/policy we'll have in the future).

Deadline scheduling: coming soon?

Posted Dec 6, 2013 11:58 UTC (Fri) by iq-0 (subscriber, #36655) [Link] (1 responses)

Version number is not the same as size. Because it can not capture that the meaning changes without the size changing. But one could easily achieve that through using flags.

But moving the size to the front would still be more logical or supplying it as an additional parameter outside the structure would be even better. Now you first have to check if it's safe to read the size from userspace and then you have to check if it's safe to access the structure.
This is effectively the same reason why you must always pass a socklen_t along with a sockaddr pointer.

But I really wonders: Do you actually always want to receive all possible fields for each policy? At least specify that is an error to supply a non-zero value for a field that is not supported by the given policy.

But it still feels like you're implementing something resembling what fcntl is for filedescriptors only for scheduling policies. Only try imagining fcntl taking a humongous structure describing all possible parameters for all different uses (I'm not saying you should mimick fcntl but it is just a syscall that caters to multiple distinct uses concerning a generic resource).

Deadline scheduling: coming soon?

Posted Dec 6, 2013 13:16 UTC (Fri) by hasta2003 (subscriber, #76829) [Link]

Since this struct is only allowed to grow (and any change is meaningless since it already contains necessary and sufficient parameters to describe real-time tasks), I think size can be considered as a version number as well.
Checks about kernel to/from userspace reads/writes are done and we ensure fwd/bwd compatibility.
Currently you are allowed to not specify a period, and it is considered equal to the deadline in this case. The other restriction is a non-null runtime. It is already specified as comments, but I should add it to documentation too.

Deadline scheduling: coming soon?

Posted Dec 5, 2013 21:07 UTC (Thu) by lmb (subscriber, #39048) [Link] (1 responses)

This is actually quite interesting for HA clustering, or other code that uses network liveness checks. It is timing sensitive (meaning it needs to be run periodically with hard guarantees, otherwise other nodes will declare this one dead), but only needs a brief tick to process everything.

The currently used SCHED_RR is prone to priority inversion or even being pushed off by less crucial processes that just happen to have picked the same or higher scheduling priority. Not to mention that a bug in the code leads to CPU starvation. Deadline scheduling would allow us to put a constraint on this.

Deadline scheduling: coming soon?

Posted Dec 6, 2013 9:23 UTC (Fri) by hasta2003 (subscriber, #76829) [Link]

Sounds interesting!
Can you give more details? Like some application in particular that suffers from the problems you depicted. If I'd have the opportunity to look at the source code I could probably figure out how we can use deadline scheduling for it.