LWN: Comments on "Deadline scheduling: coming soon?"

Deadline scheduling: coming soon?

hasta2003 — Fri, 06 Dec 2013 13:16:12 +0000

Since this struct is only allowed to grow (and any change is meaningless since it already contains necessary and sufficient parameters to describe real-time tasks), I think size can be considered as a version number as well.
Checks about kernel to/from userspace reads/writes are done and we ensure fwd/bwd compatibility.
Currently you are allowed to not specify a period, and it is considered equal to the deadline in this case. The other restriction is a non-null runtime. It is already specified as comments, but I should add it to documentation too.

Deadline scheduling: coming soon?

iq-0 — Fri, 06 Dec 2013 11:58:08 +0000

Version number is not the same as size. Because it can not capture that the meaning changes without the size changing. But one could easily achieve that through using flags.

But moving the size to the front would still be more logical or supplying it as an additional parameter outside the structure would be even better. Now you first have to check if it's safe to read the size from userspace and then you have to check if it's safe to access the structure.
This is effectively the same reason why you must always pass a socklen_t along with a sockaddr pointer.

But I really wonders: Do you actually always want to receive all possible fields for each policy? At least specify that is an error to supply a non-zero value for a field that is not supported by the given policy.

But it still feels like you're implementing something resembling what fcntl is for filedescriptors only for scheduling policies. Only try imagining fcntl taking a humongous structure describing all possible parameters for all different uses (I'm not saying you should mimick fcntl but it is just a syscall that caters to multiple distinct uses concerning a generic resource).

Deadline scheduling: coming soon?

hasta2003 — Fri, 06 Dec 2013 10:03:52 +0000

Interesting questions, some have already got answers, some others are being studied in these days (in the RT literature). What you are talking about sounds to me like pipelines of tasks, group/hierarchical scheduling and gang scheduling. I'm not saying that this is already doable using the new mechanisms, but they are the base for everything.

Let's say that in your restaurant you want to cook several dinners at the same time, and you want to assign them a different priority (friends and restaurant reviewers have to eat first and at least 10 minutes after they ordered), that's group scheduling: all activities for a single dinner receive higher priority, same could apply to Virtual Machines on a shared host. A feature we are working on and will extend soon SCHED_DEADLINE capabilities.

SCHED_DEADLINE can already allow sporadic activations: you can decide your task is activated by the completion of some other task, or when a timer fires (put broccoli in the pot after potatoes have been in the oven for 10 minutes). You just assign a deadline for them, and they will be scheduled among other tasks, no worries about relative priorities (like you probably have to do with FIFO/RR, e.g., are my broccoli more important than the steak?).

Then you may also want to assemble a cake layer after layer. Base in 1 minute, put cream on top (after base) in 30 seconds, and pass it to the next stop where it will be covered by a chocolate frosting in not more than 15 sec. That sounds to me like a pipeline of tasks, and have already been studied. Different approaches are possible (holistic view, intermediate deadlines, etc.), all of them can make use of SCHED_DEADLINE.

Lastly, let's say you want to just say "prepare a vegetarian couscous in not more than 30 minutes" and you don't want to care about parallel steps. Someone has to cook the couscous, some other has to cut onions, carrots, zucchini, etc. and then someone has to cook them. At the end all have to be put together again in a single dish a be served to the customer. There is lot of interest in this right now (OpenMP, etc.) and how to efficiently assign deadlines to parallel tasks is not yet decided, but the building blocks are already available using the new scheduling policy.

After all, you can also still use different policies for different activities, and we are actually working on how to correctly and safely make priority trickle (proxy exec being a promising candidate).

Deadline scheduling: coming soon?

hasta2003 — Fri, 06 Dec 2013 09:23:42 +0000

Sounds interesting!
Can you give more details? Like some application in particular that suffers from the problems you depicted. If I'd have the opportunity to look at the source code I could probably figure out how we can use deadline scheduling for it.

Deadline scheduling: coming soon?

hasta2003 — Fri, 06 Dec 2013 09:13:55 +0000

We have both:

- version number is same as size: #define SCHED_ATTR_SIZE_VER0 40 /* sizeof first published struct */
- a new syscall, called sched_setscheduler2, that uses the new ABI, and that is extensible (just one, not a new sched_setschedulerX for every new scheduling class/policy we'll have in the future).

Deadline scheduling: coming soon?

iq-0 — Fri, 06 Dec 2013 08:40:17 +0000

That's almost treating the syscall interface as symbol versioning. Sure it helps if the old interface was too restrictive, but in this case they just want a system for allowing different parameters for different scheduling policies. And a new syscall for each scheduling policy is not really sensible. In that case just add a syscall per scheduling policy and don't try to be generic about it.

I agree that version would probably be better than size (say something else changes but the structure remains the same size). And in that case you could possibly even drop the flags part (if you need it, change the version).

A size could still be usefull, but that would be to make the generic syscall wrapper copy all the supplied data from userspace to prevent concurrent updates on the userspace structure from confusing the kernel checks. But in that case I'd add it as a parameter in the syscall. It still wouldn't exclude a version in the structure though.

Deadline scheduling: coming soon?

dashesy — Fri, 06 Dec 2013 03:21:24 +0000

Yes size should be the first to be of any value.
Providing the size is something very often seen in Win32 API. One problem with providing the size, in general, is that sizes can only grow so perhaps a version number would be more helpful. The other problem is people start using sizeof() all over places, and memory layout and packed attribute also take effect. Win32 code is usually not pretty.
But isn't it better not to predict future too much, it will not hurt to add a new syscal in the future if needed, which comes to politics I guess. Also the norm is to add a number to the end of the old variant (e.g. dup and dup2, sched_setscheduler and sched_setscheduler2 and sched_setscheduler3), so that will fit better with overall picture too.

Deadline scheduling: coming soon?

bokr — Fri, 06 Dec 2013 01:51:49 +0000

1. What if execution time ranges are not uncorrelated? I.e., what if scheduling a before b guarantees best time for b (e.g. warming cache), but there is little or no effect on a if b is scheduled first but b first gives random or worst time for b itself? Then there is an optimum sequence.

2. Deadlines to me fall into categories that show up more easily in the concrete real time scheduling problem of cooking and serving a dinner. One deadline is when everyone is seated and baked potato and grilled steaks and steamed veggies (with a cold pat of butter on top) all have to be served near simultaneously. This entails other deadlines and responses to events. Deadline for turning on oven. Hot enough to put in potatoes. Etc.

The general pattern is resource available, lax schedule for some prepping, cooking for a fixed interval with strict deadline at the end, not for starting, so the start has to be computed back, but it is important to start accurately so it will end accurately. The potatoes take the longest, the steaks probably parallel to the broccoli. You get the picture.

Also note that the oven will be hot, so dessert souffle cook should take that into account. Worst case cooking time is not uncorrelated unless you have uncorrelated start conditions.

3. I can imagine a situation where you are supplying services with different prices according to quality. The cheapest might schedule single processors and simple low bitrate algorithms, and not meeting deadlines is just breakups in the phone conversation or frozen frames instead of keeping up with jump-cut keyframes, or whatever.

The question is if the scheduling ABI will support building a server that does both high and low QOS processing at the same time, using more CPUs and power for the high paying customer.

In terms of GUI experience, one frame is a dinner, broccoli might be a video overlay of stock ticker updated one marquee shift, etc. But everything has to be cooked and ready/consistent at buffer flip time.

4. Can the EDF be tiered into separate priorities, so that high priority tasks run first, with maybe leftover time trickling to lower levels?

5. Priority is not to be used to control logical sequence order, but sometimes a ready queue that is guaranteed to run in logical fifo/roundrobin sequence is cheaper than to guarantee access coordination by locks. Maybe just use locks and not worry about it?

Deadline scheduling: coming soon?

lmb — Thu, 05 Dec 2013 21:07:28 +0000

This is actually quite interesting for HA clustering, or other code that uses network liveness checks. It is timing sensitive (meaning it needs to be run periodically with hard guarantees, otherwise other nodes will declare this one dead), but only needs a brief tick to process everything.

The currently used SCHED_RR is prone to priority inversion or even being pushed off by less crucial processes that just happen to have picked the same or higher scheduling priority. Not to mention that a bug in the code leads to CPU starvation. Deadline scheduling would allow us to put a constraint on this.

Old syscall

corbet — Thu, 05 Dec 2013 14:19:08 +0000

I can imagine a few reasons for not wanting to use a variable-sized structure with the old syscall. First of those is simple type safety; you'd lose the ability to check the type of the "params" argument in the compiler.

Beyond that, though, how do you make something like sched_getparam() safe? You might assume that an application will never get put into an unexpected scheduler class that would cause the overflow of the smaller structure size, but such assumptions have proved dangerous in the past.

Deadline scheduling: coming soon?

iq-0 — Thu, 05 Dec 2013 13:56:05 +0000

Why is the old syscall not sufficient?

The structure is passed as a pointer. If you consider the the type of the structure that is pointed to to be dependant on the policy, than there should not be a problem. You only need assistance of the policy logic to sensibly access the value.

And why is the size not the first parameter in the structure? A structure like this would be more sensible:

struct sched_attr {
u32 size;
u32 sched_flags;
union {
struct sched_params,
struct deadline_params,
struct foobar_params,
}
}

If you need some big values in the structure for some future policy (change) than you don't have to workaround some static field in the middle of the structure. And given a size you're expected to always check it (and any possible flags) before interpreting the other fields.

Deadline scheduling: coming soon?

hasta2003 — Thu, 05 Dec 2013 13:03:34 +0000

Two things on why EDF (actually EDF + Constant Bandwidth Server = SCHED_DEADLINE) is better than RM for Hard and Soft RT:

- it can better utilize CPU power (optimal on UP, 100% of CPU power while still meeting deadlines, RM ~ 69%)
- it can provide temporal guarantees, which plain EDF or RM cannot do
- it provides temporal isolation, misbehaving tasks cannot affect your high priority activities

The RT literature is quite vast in this regard, something more can be found in Documentation (https://lkml.org/lkml/2013/11/7/271). Just let me know if you want more details :).

Thanks,

- Juri

Deadline scheduling: coming soon?

yaap — Thu, 05 Dec 2013 11:16:05 +0000

Yes, hard real-time implies deadlines. But there is a difference between deadlines known in advance, and reactive deadlines where an unexpected asynchronous event must be processed within a given response time. Both are cases of hard real-time, where there is a clear deadline on some processing. EDF scheduling is a natural match for the first case, but a preemptive scheduler is fine for the second case.

Also, even for the first case EDF is not a requirement. One can use a preemptive scheduler with rate monotonic analysis (see https://en.wikipedia.org/wiki/Rate-monotonic_scheduling) to make sure deadlines are met. And from memory it seems this has been done: provide an EDF like library API that will check that RMA conditions are met, but just allocate a priority and run on a preemptive scheduler.

So although hard RT is about deadlines and EDF has the proper keyword in its name, it is never a necessity. That may explain why there is little traction.

Deadline scheduling: coming soon?

hasta2003 — Thu, 05 Dec 2013 11:02:06 +0000

Thanks Jonathan for this report!

I'd just like to add that, for anyone that wants to experiment with the new ABI, branch "new-ABI" on https://github.com/jlelli/sched-deadline is the most updated source.

As usual, any kind of feedback is highly appreciated (especially if it comes with a nice use-case :)).

Deadline scheduling: coming soon?

kugel — Thu, 05 Dec 2013 10:38:23 +0000

Right, row-latency response is not the same as hard realtime, though one often wants both.

Deadline scheduling and virtualization

Lennie — Thu, 05 Dec 2013 10:16:32 +0000

Virtualization has always been messy. There is no proper way to do it, it's pretty messy.

It's one kernel trying to communicate with an other kernel. So the best way is to create a some kind of para-virtualized API.

Best way to get any guarantees might actually be to use containers instead.

Deadline scheduling: coming soon?

tseaver — Thu, 05 Dec 2013 04:55:24 +0000

"Hard" realtime implies deadlines by definition[1]; if deadlines aren't
applicable to your problem, you are doing something else.

]1] http://en.wikipedia.org/wiki/Hard_realtime#Hard

Deadline scheduling: coming soon?

smurf — Wed, 04 Dec 2013 23:27:07 +0000

> 1. When does hard real-time not require deadlines?

If you have infrequent events which you must react to _right_now_, you definitely need "hard" real time, yet a deadline scheduler is useless.

Deadline scheduling and virtualization

nevets — Wed, 04 Dec 2013 21:09:33 +0000

Perhaps using OSv would be useful here. As OSv does not allow fork as well, but you get a virtualized OS that run a single process with all the access to the virtual hardware (think a virtualized DOS). Having several of these instances under a deadline scheduler will allow them to get a guaranteed amount of CPU.

Deadline scheduling and virtualization

jhhaller — Wed, 04 Dec 2013 19:53:42 +0000

An interesting point (to me, at least) would be how one could use this API to support deadline scheduling in a virtualized guest. While the simple approach would be to lock the guest to particular cores, and keep other work away from those cores, that prevents using those cores for low-priority work.

A more complicated guest would need to use a para-virtualized scheduler, so that multiple deadlines can be passed to the hypervisor, and somehow pass that data to the kernel. Ideally, the guest would pass the priority of it's currently running thread to the hypervisor, so it could properly schedule the process with the native OS, but probably needs to register multiple priorities and deadlines, so that the native OS will escalate the priority of the hypervisor so it will schedule the hypervisor when it's time for it's guest OS to run a high priority task. Keeping the hypervisor and guest OS priorities synchronized will be a challenge, as the guest OS could finish it's high priority tasks early.

This seems to be a case where cgroups will work better than strict virtualization, once deadline scheduling is implemented there, but cgroups have some of the same type of implementation issues as hypervisors, but give the OS more information about thread-level scheduling information.

Deadline scheduling: coming soon?

bokr — Wed, 04 Dec 2013 19:35:02 +0000

1. When does hard real-time not require deadlines?

2. I am wondering if complex policy and critical path stuff wouldn't best be done in userland at high realtime priority I.e., let such a process do the figuring and policy implementation planning to meet deadlines, and give it an ABI to tell the OS what to run in absolute priority starting when. Then it can yield its own CPU by sleeping a planned amount, as an inverted scheduling of other work.