The unified control group hierarchy in 3.16
At its core, the control group subsystem is simply a way of organizing processes into hierarchies; controllers can then be applied to the hierarchies to enforce policies on the processes contained therein. From the beginning, control groups have allowed the creation of multiple hierarchies, each of which can contain a different mix of processes. So one could, for example, create one hierarchy and attach the CPU scheduler controller to it. Another hierarchy could be created for the memory controller; it could contain the same processes, but with a different organization. That would allow memory usage policy to be applied to different groupings of the same processes.
This flexibility has a certain appeal, but it has its costs. It can be expensive for the kernel to keep track of all the controllers that apply to a given process. Controllers also cannot effectively cooperate with each other, since they may be operating on entirely different hierarchies. In some cases (memory and block I/O bandwidth control, for example), better cooperation is needed to effectively control resource use. And, in the end, there has been little real-world use of this feature. So the plan has long been to get rid of the multiple-hierarchy feature, though it has always been known that this change would take a long time to effect fully.
Work on the unified control group hierarchy has been underway for some time, with much of the preparatory work being merged into the 3.14 and 3.15 kernels. In 3.16, this feature will be available, but only to users who ask for it explicitly. To use the unified hierarchy, the new control group virtual filesystem should be mounted with a command like:
mount -t cgroup -o __DEVEL__sane_behavior cgroup <mount-point>
Obviously, the __DEVEL__sane_behavior option is not intended to be a permanent fixture. It may still be some time, though, before the unified hierarchy becomes available as a default feature.
It is worth noting that the older, multiple-hierarchy mode continues to work even if the unified hierarchy mode is used; it will be kept around for as long as it seems to be needed. The unified hierarchy can be instantiated alongside older hierarchies, but controllers cannot be shared between the unified hierarchy and any others. The care that has been taken in this area should allow users to experiment with the unified mode while avoiding changes that would break existing systems.
In current kernels, controllers are attached to control groups by specifying options to the mount command that creates the hierarchy. In the unified hierarchy world, instead, all controllers are attached to the root of the hierarchy. (Strictly speaking that's not quite true; controllers attached to old-style hierarchies will not be available in the unified hierarchy, but that's a detail that can be ignored for now). Controllers can be enabled for specific subtrees of the hierarchy, subject to a small set of rules. For the purposes of illustrating these rules, imagine a control group hierarchy like the one shown on the right; groups A and B live directly under the root control group, while C and D are children of B.
Each control group in the hierarchy has (in its associated control directory) a file called cgroup.controllers that lists the controllers that can be enabled for children of that group. Another file, cgroup.subtree_control, lists the controllers that are actually enabled; writing to that file can turn controllers on or off. It is worth repeating that these files manage the controllers attached to the children of the group; in the unified hierarchy, a control group is thought of as delegating its resources to subgroups for management. There are some interesting implications resulting from this design.
One of those is that a control group must apply a controller to all of its children or none. If the memory controller is enabled in B's cgroup.subtree_control file, it will apply to both C and D; there is no way (from B's point of view) to apply the controller to only one of those subgroups. Further, a controller can only be enabled in a specific control group if it is enabled in that group's parent; a controller cannot be enabled in group C unless it is already enabled in group B. That suggests that all controllers that are actually meant to be used must be enabled in the root control group, at which point they will apply to the entire hierarchy. It is, however, possible to disable a controller at a lower level. So, if the CPU controller is enabled in the root, it can be disabled in group A, exempting all of A's descendant groups from CPU control.
Another new rule is that the cgroup.subtree_control file can only be used to change the set of active controllers if the associated group contains no processes. So, for example, if group B has controllers enabled in its cgroup.subtree_control file, it cannot contain any processes; those processes must all be placed into group C or D. This rule prevents situations where processes in the parent control group are competing with those in the child groups — situations that current controllers handle inconsistently and, often, badly. The one exception to the "no processes" rule is the root control group.
One other control file found in the unified hierarchy is called cgroup.populated; reading it will return a nonzero value if there are any processes in the group (or its descendants). By using poll() on this file, a process can be notified if a control group becomes completely empty; the process would presumably respond by cleaning up and removing the group. Current kernels, instead, create a helper process to provide the notification; this technique has been frowned on for years.
The unified hierarchy will allow a privileged process to delegate access to control group functionality by changing the owner of the associated control files. But this delegation only works to an extent: a unprivileged process with access to the control files can create child control groups and move processes between groups, but it cannot change any controller settings. This policy is there partly to keep unprivileged processes from disrupting the system, but the intent is also to restrict access to the more advanced control knobs. These knobs are currently deemed to expose too much information about the kernel's internals, so there is a desire to avoid having applications depend on them.
All of this work has been extensively discussed for years, with most of the
major users of control groups having had their say. So it should be suitable
for most of the known uses today, but that is no substitute for actually
seeing things work. The 3.16 kernel will provide an opportunity
for interested users to try out the new mode and find out which problems
remain; actual migration by users to the new scheme cannot be expected to
happen for a few more development cycles at the earliest, though. But, at
some point, the control
group rework will cease being something that's mostly talked about and
become just another big job that eventually got done.
Index entries for this article | |
---|---|
Kernel | Control groups |
Posted Jun 12, 2014 6:25 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
I can live with that, it's certainly better than The One Daemon To Rule Them All approach that systemd loves and wants.
Now, ability to grant permissions to move processes between cgroups to unprivileged users is baffling. It's not really of much use at all, without corresponding ability to change knobs. I understand that developers are hesitant to allow manipulation of some settings, but perhaps they can divide settings into 'good' and 'bad' sets and allowing unrestricted access only to the 'good' set?
Posted Jun 12, 2014 16:17 UTC (Thu)
by raven667 (subscriber, #5198)
[Link] (1 responses)
Posted Jun 12, 2014 16:31 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
And systemd was all for it. See here as an example: https://lwn.net/Articles/555922/
There were similar (in spirit) messages from Tejun Heo.
So I guess somebody hit the cgroups developers hard enough to make them see the light and re-introduce a sane delegation mechanism.
Posted Jun 15, 2014 17:40 UTC (Sun)
by alison (subscriber, #63752)
[Link] (3 responses)
Posted Aug 4, 2014 14:55 UTC (Mon)
by kloczek (guest, #6391)
[Link] (2 responses)
Posted Aug 4, 2014 16:49 UTC (Mon)
by dlang (guest, #313)
[Link]
that document says that the internal API of the kernel is not stable.
the User API to the kernel is very stable.
Posted Aug 5, 2014 23:39 UTC (Tue)
by nix (subscriber, #2304)
[Link]
You really don't know very much about Linux at this level, do you?
Posted Jun 16, 2014 22:24 UTC (Mon)
by kleptog (subscriber, #1183)
[Link] (1 responses)
For example, having a hierarchy for the processes like systemd wants and a hierarchy for resources seems like it could work. And would satisfy more people I believe.
Posted Jun 17, 2014 8:03 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Cpuacct is harmless - it's a read-only accounting tool and can be used without too much consideration for its overhead or cross-controller interactions.
Freezer is different - it's often necessary to stop multiple processes atomically and they very well might be on separate levels.
Posted Jun 17, 2014 9:22 UTC (Tue)
by MrWim (subscriber, #47432)
[Link]
This would be useful for things like make where you might want to by default avoid slowing your other applications when you pass -j20.
I would find it most useful for unit test runners where you want to be certain that you've killed all the processes that were started by the test when the test ends. Essentially it would be process groups that actually work.
Posted Aug 4, 2014 14:37 UTC (Mon)
by kloczek (guest, #6391)
[Link] (2 responses)
Ten years after developing DTrace on Solaris most of the time spend on SystemTap, LTT, LTTng and many other attempts can be put in garbage and now more people on Linux is using DTrace delivered by commercial company.
Why Linux developers are trying again and again repeating the same errors and expecting that at some point it will Work(tm)?
Posted Aug 4, 2014 16:48 UTC (Mon)
by dlang (guest, #313)
[Link]
> Ten years after developing DTrace on Solaris
Sun licensed DTrace in a way that is deliberately incompatible with the GPLv2 license of the Linux kernel. As a result, it can't legally be distributed for Linux.
So blame this one on Sun/Oracle not Linux developers.
Posted Aug 5, 2014 23:37 UTC (Tue)
by nix (subscriber, #2304)
[Link]
(But maybe you mean some other Linux DTrace developed by a commercial company? Or perhaps you mean not 'more people than use SystemTap / perf / something else' but rather 'more people than used to use it', which is trivially true if it is used by anyone at all, since it has not always existed.)
The unified control group hierarchy in 3.16
The unified control group hierarchy in 3.16
The unified control group hierarchy in 3.16
>This hierarchy becomes private property of systemd. systemd will set
>it up. Systemd will maintain it. Systemd will rearrange it. Other
>software that wants to make use of cgroups can do so only through
>systemd's APIs. This single-writer logic is absolutely necessary
And now cgrouproot can live in /proc?
And now cgrouproot can live in /proc?
But you know .. Linux is now mature OS so it cannot change suddenly UAPI (despite that in Documentation directory still you can find document listing why Linux does not need stable KAPI/UAPI).
Linux has some kind of schizophrenia. In procfs you can find even some old attempts to try maintain not only processes and threads but groups of processes as well like /proc/<PID>/task/* but who cares that current attempt to catch up something which is working more than decade in other OSes is breaking something existing.
Cgroups development started at 2007. Who cares that after 7 years still is useless on providing very basic functionalities?
Let's give the chance new kernel developers generation to contribute to growing Linux kernel entropy .. isn't it?
And now cgrouproot can live in /proc?
And now cgrouproot can live in /proc?
The unified control group hierarchy in 3.16
The unified control group hierarchy in 3.16
The unified control group hierarchy in 3.16
How many years will take Linus&co to develop Solaris contractfs+project
Why Linux still is suffering on NIH (Not Invented Here) syndrome?
Why something so simple like managing tasks and processes must be driven by yet-another-stupid-fs?
Why no one from Linux developers is able to sit down study existing implementation of solutions of some problems, after this develop on first step consistent base API with plan how to extend base functionalities, and after this stick to agreed/approved plan?
Why .. ?
Why .. ?
.
.
How many years will take Linus&co to develop Solaris contractfs+project
How many years will take Linus&co to develop Solaris contractfs+project
now more people on Linux is using DTrace delivered by commercial company
Well, I'd be very interested to hear where you got this information from. I'm one of the DTrace for Linux developers, and, y'know, I don't have that information. Possibly my bosses have it, but if so they haven't told me. To be honest I have no idea how anyone could know this sort of thing without horrendously invasive spying on users, or wildly unreliable usage surveys which have as far as know not been conducted.