Deadline scheduling: coming soon?
To recap briefly: deadline scheduling does away with the concept of process priorities that has been at the core of most CPU scheduler algorithms. Instead, each process provides three parameters to the scheduler: a "worst-case execution time" describing a maximum amount of CPU time needed to accomplish its task, a period describing how often the task must be performed, and a deadline specifying when the task must first be completed. The actual scheduling algorithm is then relatively simple: the task whose deadline is closest runs first. If the scheduler takes care to not allow the creation of deadline tasks when the sum of the worst-case execution times would exceed the amount of available CPU time, it can guarantee that every task will be able to finish by its deadline.
Deadline scheduling is thus useful for realtime tasks, where completion by a deadline is a key requirement. It is also applicable to periodic tasks like streaming media processing.
In recent times, work on deadline scheduling has been done by Juri Lelli. He has posted several versions, improving things along the way. His v8 posting in October generated a fair amount of discussion, including suggestions from scheduler maintainers Peter Zijlstra and Ingo Molnar that the time had come to merge this code. That merging did not happen for 3.13, but chances are that it will for a near-future kernel release, barring some sort of unexpected roadblock. The main thing that remains to be done is to get the user-space ABI nailed down, since that aspect is hard to change after it has been released in a mainline kernel.
Controlling the scheduler
To be able to guarantee that deadlines will be met, a deadline scheduler must have an incontestable claim to the CPU, so deadline tasks will run ahead of all other tasks — even those in the realtime scheduler classes. Deadline-scheduled processes cannot take all of the available CPU time, though; the amount of time actually available is controlled by a set of sysctl knobs found under /proc/sys/kernel/. The first two already exist in current kernels: sched_rt_runtime_us and sched_rt_period_us. The first specifies the amount of CPU time (in microseconds) available to realtime tasks, while the second gives the period over which that CPU time is available. By default, 95% of the total CPU time is made available to realtime processes, leaving 5% to give a desperate system administrator a chance to recover a system from a runaway realtime process.
The new sched_dl_runtime_us knob is used to say how much of the realtime allocation is available for use by the deadline scheduler. The default setting allocates 40% for deadline scheduling, but a system administrator may well want to tweak that value. Note that, while realtime scheduling is supported by control groups, deadline scheduling has not yet been implemented at that level. How deadline scheduling should interact with group scheduling raises some interesting questions that have not yet been fully answered.
The other piece of the ABI allows processes to enter and control the deadline scheduling regime. The current system call for changing a process's scheduling class is:
int sched_setscheduler(pid_t pid, int policy, const struct sched_param *param);
The sched_param structure used in this system call is quite simple:
struct sched_param { int sched_priority; };
So sched_setscheduler() works fine for the currently available scheduling classes; the desired class is specified with the policy parameter, while the associated process priority goes into param. But struct sched_param clearly does not have the space needed to hold the three parameters needed with deadline scheduling, and its definition cannot be changed without breaking the existing ABI. So a new system call will be needed. As of this writing the details are still under discussion, but the new ABI can be expected to look something like this:
struct sched_attr { int sched_priority; unsigned int sched_flags; u64 sched_runtime; u64 sched_deadline; u64 sched_period; u32 size; }; int sched_setscheduler2(pid_t pid, int policy, const struct sched_attr *param); int sched_setattr(pid_t pid, const struct sched_attr *param); int sched_getattr(pid_t pid, struct sched_attr *param, unsigned int size);
Where size (as both a parameter and a structure field) is the size of the sched_attr structure. If, in the future, the need arises to add more fields to that structure, the kernel will be able to use the size value to determine which version of the structure an application is using and respond accordingly. For the curious: size is meant to be specified within struct sched_attr when that structure is, itself, an input parameter to the kernel; otherwise size is given separately. The sched_flags field of struct sched_attr is not used in the current version of the patch.
One other noteworthy detail is that processes running in the new SCHED_DEADLINE class are not allowed to fork children in the same class. As with the realtime scheduling classes, this restriction can be worked around by setting the scheduling class to SCHED_DEADLINE|SCHED_RESET_ON_FORK, which causes the child to be placed back into the default scheduling class. Without that flag, a call to fork() will fail.
Time to merge?
The deadline scheduling patch set has a number of loose ends left to be dealt with, many of which are indicated in the patches themselves. But there comes a point where it is best to go ahead and get the code into the mainline so that said loose ends can be tied down more quickly; the deadline scheduling patches may well have reached that point. Since deadline scheduling can be added without much risk of regressions on systems where it is not in use, there should not be a whole lot more that needs to be dealt with before it can be merged.
...except, maybe, for one little thing. When deadline scheduling was discussed at the 2010 Kernel Summit, Linus and
others clearly worried that there may not be actual users for this
functionality. There has not been a whole lot of effort put into
demonstrating users for deadline scheduling since then, though it is worth
noting that the Yocto project has included the patch in some of its
kernels. The JUNIPER project is
also planning to use deadline scheduling, and has been supporting its
development. Users like these will definitely help the deadline
scheduler's case; Linus has become wary
of adding features that may not actually be used. If that little point
can be adequately addressed, we may have deadline scheduling in the
mainline in the near future.
Index entries for this article | |
---|---|
Kernel | Scheduler/Deadline scheduling |
Posted Dec 4, 2013 19:35 UTC (Wed)
by bokr (subscriber, #58369)
[Link] (7 responses)
2. I am wondering if complex policy and critical path stuff wouldn't best be done in userland at high realtime priority I.e., let such a process do the figuring and policy implementation planning to meet deadlines, and give it an ABI to tell the OS what to run in absolute priority starting when. Then it can yield its own CPU by sleeping a planned amount, as an inverted scheduling of other work.
Posted Dec 4, 2013 23:27 UTC (Wed)
by smurf (subscriber, #17840)
[Link] (6 responses)
If you have infrequent events which you must react to _right_now_, you definitely need "hard" real time, yet a deadline scheduler is useless.
Posted Dec 5, 2013 4:55 UTC (Thu)
by tseaver (guest, #1544)
[Link] (5 responses)
]1] http://en.wikipedia.org/wiki/Hard_realtime#Hard
Posted Dec 5, 2013 10:38 UTC (Thu)
by kugel (subscriber, #70540)
[Link]
Posted Dec 5, 2013 11:16 UTC (Thu)
by yaap (subscriber, #71398)
[Link] (3 responses)
Also, even for the first case EDF is not a requirement. One can use a preemptive scheduler with rate monotonic analysis (see https://en.wikipedia.org/wiki/Rate-monotonic_scheduling) to make sure deadlines are met. And from memory it seems this has been done: provide an EDF like library API that will check that RMA conditions are met, but just allocate a priority and run on a preemptive scheduler.
So although hard RT is about deadlines and EDF has the proper keyword in its name, it is never a necessity. That may explain why there is little traction.
Posted Dec 5, 2013 13:03 UTC (Thu)
by hasta2003 (subscriber, #76829)
[Link] (2 responses)
- it can better utilize CPU power (optimal on UP, 100% of CPU power while still meeting deadlines, RM ~ 69%)
The RT literature is quite vast in this regard, something more can be found in Documentation (https://lkml.org/lkml/2013/11/7/271). Just let me know if you want more details :).
Thanks,
- Juri
Posted Dec 6, 2013 1:51 UTC (Fri)
by bokr (subscriber, #58369)
[Link] (1 responses)
2. Deadlines to me fall into categories that show up more easily in the concrete real time scheduling problem of cooking and serving a dinner. One deadline is when everyone is seated and baked potato and grilled steaks and steamed veggies (with a cold pat of butter on top) all have to be served near simultaneously. This entails other deadlines and responses to events. Deadline for turning on oven. Hot enough to put in potatoes. Etc.
The general pattern is resource available, lax schedule for some prepping, cooking for a fixed interval with strict deadline at the end, not for starting, so the start has to be computed back, but it is important to start accurately so it will end accurately. The potatoes take the longest, the steaks probably parallel to the broccoli. You get the picture.
Also note that the oven will be hot, so dessert souffle cook should take that into account. Worst case cooking time is not uncorrelated unless you have uncorrelated start conditions.
3. I can imagine a situation where you are supplying services with different prices according to quality. The cheapest might schedule single processors and simple low bitrate algorithms, and not meeting deadlines is just breakups in the phone conversation or frozen frames instead of keeping up with jump-cut keyframes, or whatever.
The question is if the scheduling ABI will support building a server that does both high and low QOS processing at the same time, using more CPUs and power for the high paying customer.
In terms of GUI experience, one frame is a dinner, broccoli might be a video overlay of stock ticker updated one marquee shift, etc. But everything has to be cooked and ready/consistent at buffer flip time.
4. Can the EDF be tiered into separate priorities, so that high priority tasks run first, with maybe leftover time trickling to lower levels?
5. Priority is not to be used to control logical sequence order, but sometimes a ready queue that is guaranteed to run in logical fifo/roundrobin sequence is cheaper than to guarantee access coordination by locks. Maybe just use locks and not worry about it?
Posted Dec 6, 2013 10:03 UTC (Fri)
by hasta2003 (subscriber, #76829)
[Link]
Let's say that in your restaurant you want to cook several dinners at the same time, and you want to assign them a different priority (friends and restaurant reviewers have to eat first and at least 10 minutes after they ordered), that's group scheduling: all activities for a single dinner receive higher priority, same could apply to Virtual Machines on a shared host. A feature we are working on and will extend soon SCHED_DEADLINE capabilities.
SCHED_DEADLINE can already allow sporadic activations: you can decide your task is activated by the completion of some other task, or when a timer fires (put broccoli in the pot after potatoes have been in the oven for 10 minutes). You just assign a deadline for them, and they will be scheduled among other tasks, no worries about relative priorities (like you probably have to do with FIFO/RR, e.g., are my broccoli more important than the steak?).
Then you may also want to assemble a cake layer after layer. Base in 1 minute, put cream on top (after base) in 30 seconds, and pass it to the next stop where it will be covered by a chocolate frosting in not more than 15 sec. That sounds to me like a pipeline of tasks, and have already been studied. Different approaches are possible (holistic view, intermediate deadlines, etc.), all of them can make use of SCHED_DEADLINE.
Lastly, let's say you want to just say "prepare a vegetarian couscous in not more than 30 minutes" and you don't want to care about parallel steps. Someone has to cook the couscous, some other has to cut onions, carrots, zucchini, etc. and then someone has to cook them. At the end all have to be put together again in a single dish a be served to the customer. There is lot of interest in this right now (OpenMP, etc.) and how to efficiently assign deadlines to parallel tasks is not yet decided, but the building blocks are already available using the new scheduling policy.
After all, you can also still use different policies for different activities, and we are actually working on how to correctly and safely make priority trickle (proxy exec being a promising candidate).
Posted Dec 4, 2013 19:53 UTC (Wed)
by jhhaller (guest, #56103)
[Link] (2 responses)
A more complicated guest would need to use a para-virtualized scheduler, so that multiple deadlines can be passed to the hypervisor, and somehow pass that data to the kernel. Ideally, the guest would pass the priority of it's currently running thread to the hypervisor, so it could properly schedule the process with the native OS, but probably needs to register multiple priorities and deadlines, so that the native OS will escalate the priority of the hypervisor so it will schedule the hypervisor when it's time for it's guest OS to run a high priority task. Keeping the hypervisor and guest OS priorities synchronized will be a challenge, as the guest OS could finish it's high priority tasks early.
This seems to be a case where cgroups will work better than strict virtualization, once deadline scheduling is implemented there, but cgroups have some of the same type of implementation issues as hypervisors, but give the OS more information about thread-level scheduling information.
Posted Dec 4, 2013 21:09 UTC (Wed)
by nevets (subscriber, #11875)
[Link]
Posted Dec 5, 2013 10:16 UTC (Thu)
by Lennie (subscriber, #49641)
[Link]
It's one kernel trying to communicate with an other kernel. So the best way is to create a some kind of para-virtualized API.
Best way to get any guarantees might actually be to use containers instead.
Posted Dec 5, 2013 11:02 UTC (Thu)
by hasta2003 (subscriber, #76829)
[Link]
I'd just like to add that, for anyone that wants to experiment with the new ABI, branch "new-ABI" on https://github.com/jlelli/sched-deadline is the most updated source.
As usual, any kind of feedback is highly appreciated (especially if it comes with a nice use-case :)).
Posted Dec 5, 2013 13:56 UTC (Thu)
by iq-0 (subscriber, #36655)
[Link] (6 responses)
The structure is passed as a pointer. If you consider the the type of the structure that is pointed to to be dependant on the policy, than there should not be a problem. You only need assistance of the policy logic to sensibly access the value.
And why is the size not the first parameter in the structure? A structure like this would be more sensible:
struct sched_attr {
If you need some big values in the structure for some future policy (change) than you don't have to workaround some static field in the middle of the structure. And given a size you're expected to always check it (and any possible flags) before interpreting the other fields.
Posted Dec 5, 2013 14:19 UTC (Thu)
by corbet (editor, #1)
[Link]
Beyond that, though, how do you make something like sched_getparam() safe? You might assume that an application will never get put into an unexpected scheduler class that would cause the overflow of the smaller structure size, but such assumptions have proved dangerous in the past.
Posted Dec 6, 2013 3:21 UTC (Fri)
by dashesy (guest, #74652)
[Link] (4 responses)
Posted Dec 6, 2013 8:40 UTC (Fri)
by iq-0 (subscriber, #36655)
[Link]
I agree that version would probably be better than size (say something else changes but the structure remains the same size). And in that case you could possibly even drop the flags part (if you need it, change the version).
A size could still be usefull, but that would be to make the generic syscall wrapper copy all the supplied data from userspace to prevent concurrent updates on the userspace structure from confusing the kernel checks. But in that case I'd add it as a parameter in the syscall. It still wouldn't exclude a version in the structure though.
Posted Dec 6, 2013 9:13 UTC (Fri)
by hasta2003 (subscriber, #76829)
[Link] (2 responses)
- version number is same as size: #define SCHED_ATTR_SIZE_VER0 40 /* sizeof first published struct */
Posted Dec 6, 2013 11:58 UTC (Fri)
by iq-0 (subscriber, #36655)
[Link] (1 responses)
But moving the size to the front would still be more logical or supplying it as an additional parameter outside the structure would be even better. Now you first have to check if it's safe to read the size from userspace and then you have to check if it's safe to access the structure.
But I really wonders: Do you actually always want to receive all possible fields for each policy? At least specify that is an error to supply a non-zero value for a field that is not supported by the given policy.
But it still feels like you're implementing something resembling what fcntl is for filedescriptors only for scheduling policies. Only try imagining fcntl taking a humongous structure describing all possible parameters for all different uses (I'm not saying you should mimick fcntl but it is just a syscall that caters to multiple distinct uses concerning a generic resource).
Posted Dec 6, 2013 13:16 UTC (Fri)
by hasta2003 (subscriber, #76829)
[Link]
Posted Dec 5, 2013 21:07 UTC (Thu)
by lmb (subscriber, #39048)
[Link] (1 responses)
The currently used SCHED_RR is prone to priority inversion or even being pushed off by less crucial processes that just happen to have picked the same or higher scheduling priority. Not to mention that a bug in the code leads to CPU starvation. Deadline scheduling would allow us to put a constraint on this.
Posted Dec 6, 2013 9:23 UTC (Fri)
by hasta2003 (subscriber, #76829)
[Link]
Deadline scheduling: coming soon?
Deadline scheduling: coming soon?
Deadline scheduling: coming soon?
applicable to your problem, you are doing something else.
Deadline scheduling: coming soon?
Deadline scheduling: coming soon?
Deadline scheduling: coming soon?
- it can provide temporal guarantees, which plain EDF or RM cannot do
- it provides temporal isolation, misbehaving tasks cannot affect your high priority activities
Deadline scheduling: coming soon?
Deadline scheduling: coming soon?
Deadline scheduling and virtualization
Deadline scheduling and virtualization
Deadline scheduling and virtualization
Deadline scheduling: coming soon?
Deadline scheduling: coming soon?
u32 size;
u32 sched_flags;
union {
struct sched_params,
struct deadline_params,
struct foobar_params,
}
}
I can imagine a few reasons for not wanting to use a variable-sized structure with the old syscall. First of those is simple type safety; you'd lose the ability to check the type of the "params" argument in the compiler.
Old syscall
Yes size should be the first to be of any value.Deadline scheduling: coming soon?
Providing the size is something very often seen in Win32 API. One problem with providing the size, in general, is that sizes can only grow so perhaps a version number would be more helpful. The other problem is people start using sizeof()
all over places, and memory layout and packed attribute also take effect. Win32 code is usually not pretty.
But isn't it better not to predict future too much, it will not hurt to add a new syscal in the future if needed, which comes to politics I guess. Also the norm is to add a number to the end of the old variant (e.g. dup and dup2, sched_setscheduler and sched_setscheduler2 and sched_setscheduler3), so that will fit better with overall picture too.
Deadline scheduling: coming soon?
Deadline scheduling: coming soon?
- a new syscall, called sched_setscheduler2, that uses the new ABI, and that is extensible (just one, not a new sched_setschedulerX for every new scheduling class/policy we'll have in the future).
Deadline scheduling: coming soon?
This is effectively the same reason why you must always pass a socklen_t along with a sockaddr pointer.
Deadline scheduling: coming soon?
Checks about kernel to/from userspace reads/writes are done and we ensure fwd/bwd compatibility.
Currently you are allowed to not specify a period, and it is considered equal to the deadline in this case. The other restriction is a non-null runtime. It is already specified as comments, but I should add it to documentation too.
Deadline scheduling: coming soon?
Deadline scheduling: coming soon?
Can you give more details? Like some application in particular that suffers from the problems you depicted. If I'd have the opportunity to look at the source code I could probably figure out how we can use deadline scheduling for it.