Designing better kernel ABIs

By Jonathan Corbet
October 26, 2016

Michael Kerrisk started his 2016 Kernel Recipes talk by noting that the man pages collection, which he maintains, now documents over 2,200 interfaces at the kernel level and above; it adds up to about 2,500 printed pages. In the course of writing and collecting all this documentation, he has seen a lot of ABIs. There are many lessons to be learned from this experience, he said, if we want to create better ABIs in the future.

There is a whole set of obvious goals that one might want to meet when designing an ABI. It should, naturally, be bug-free and as simple as possible, while being extensible and maintainable. Compliance with existing standards is important as well. Presumably, he said, we all agree on these goals — but we repeatedly fail on all of them.

A history of failure

Failure comes in many forms. One of those, of course, is simple bugs. "Show me a new interface," Kerrisk said, "and I'll show you a bug." There is insufficient pre-release testing of new ABIs, meaning that too many bugs slip through. That, in turn, leads to pain for user space, which may need to carry special-case code to handle the bugs found in different kernel versions.

Interface inconsistencies find their way in at a number of levels, including at the design level; he pointed out that the kernel has about a half-dozen different architecture-dependent clone() interfaces. There are also behavioral inconsistencies that can create ongoing pain for application developers. Consider, for example, the mlock() system call:

    int mlock(const void *addr, size_t len);

When mlock() is called, the addr argument will be rounded down to the nearest page boundary, while the end of the range (addr+len) will be rounded up. Thus, calling mlock(4000, 6000) will affect memory addresses zero through 12287. Now consider remap_file_pages():

    int remap_file_pages(void *addr, size_t len, int prot,
                         size_t pgoff, int flags);

This system call rounds len downward, so a call to:

    remap_file_pages(4000, 6000, ...);

will affect the range from zero to 4095. That sort of surprising behavioral difference, he said, is an ongoing source of bugs. Another breeding ground is the set of system calls that change the attributes of another process; these include setpriority(), ioprio_set(), migrate_pages(), and more. Each of these must check the credentials of the caller and decide when an unprivileged caller will be able to carry out the requested action; each interface has a different set of rules.

What about maintainability? System calls should normally include a flags argument, but that has often not been the case. As a result, the number of system calls seems to grow without bound. umount() lacked a flags argument, so now we have umount2(). For similar reasons, the kernel offers preadv2(), epoll_create1(), renameat2(), and more. In some cases, the original interface was a historical legacy, but others "we did to ourselves." A related problem is a failure to check for unknown flags, as seen in sigaction(), recv(), clock_nanosleep(), msgrcv(), semget(), semop(), open(), and more. Among other things, failure to return an error on an unknown flag means that user space cannot check whether a given feature is supported or not.

In short, he said, decentralized design as seen in the kernel community has its advantages, but it fails badly if the goal is a coherent design. As an example, consider capabilities. When a new privileged feature is added to the kernel, the question arises: should a new capability be added to control access to it? Nobody wants to see an explosion in the number of capabilities, so it is generally deemed preferable to use an existing one. In practice, the existing one is almost invariably CAP_SYS_ADMIN, which is tested for in some 40% of all cases. It is, he said, "the new root", and the goal of finer-grained privilege checking has not been achieved. The first version of the control-group ABI was also plagued by inconsistencies.

In the end, he said, "we are just traditionalists" following and upholding a long history of Unix ABI mess-ups. The problem, of course, is that interface design is hard, and errors normally cannot be fixed without breaking existing user-space programs. So thousands of programs have to live with the consequences of our ABI mistakes for decades. We really need to get better at getting things right the first time.

Avoiding failure

How do we do that? A lot of it comes down to review and testing. Unlike some other parts of the kernel, ABI design does not really lend itself to mechanical testing, though. New interfaces simply need a lot of human review. That said, there is a place for unit tests; the kernel has been slow to adopt them, but that is beginning to change. Unit tests can detect regressions and unexpected changes, and help to ensure that a new interface lives up to what it is supposed to do.

Consider, for example, the recvmmsg() system call. Toward the end of the discussion before this call was merged, it gained a new timeout parameter. The expectation was that this timeout would apply to the call as a whole. In truth, it was only tested after the receipt of a datagram; until something shows up, the timeout has no effect at all. In other words, nobody bothered to test it and, as a result, it was useless.

Once tests are written, where should they go? The Linux Test Project is the traditional home for such tests, but it is not ideal. It is an out-of-tree test suite, and new tests only show up there after the ABI they test has appeared in an official release. Test coverage is partial; in the end, it simply does not solve the problem. The kernel's self-testing facility is a better place; importantly, it has a paid maintainer. Those interested in working with kselftest can find more information on the kernel.org wiki.

There is only so much value to be had from testing, though, in the absence of a specification for how a new interface is expected to behave. In the case of recvmmsg(), nobody ever wrote that specification, so it was not possible to write a test for it. There are many benefits to written specifications; they serve as a target for the implementation and help those who write tests. A specification allows reviewers to critique the interface independently of the implementation, and increases the number of reviewers overall. This specification generally belongs in the changelog of the patch adding the new interface, though an even better approach is to send a man-page patch.

The best thing to do, though, is to write a real application that uses the new interface. A while back he decided to delve into the inotify interface in order to improve its documentation. It is, in many ways, a good interface, much better than its predecessor. But it could have been better yet. At one point he thought he understood it, so he tried to write a real application that used it; the result was this article series, among other things.

That application required 1,500 lines of C code to get its job done. The inotify interface leaves a lot of work for the application to do. For example, change notifications lack user and process-ID information, making it impossible for a monitoring application to know who made a change. Directory monitoring is not recursive; if an application wants to watch a directory tree, it must set a separate watch on every directory in that tree. That, he said, may be unavoidable in the end.

A problem that was avoidable, instead, has to do with the renaming of files. A rename will generally result in two events; a "rename from" event and a "rename to" event. Unfortunately, these two events are not guaranteed to be consecutive in the event stream. In fact, they are not even guaranteed to both exist: if a file is renamed into a directory that is not monitored, the "rename to" event will not be generated. So an application has no definitive way to know if it will ever receive a "rename to" event or not; the result is a series of racy workarounds in user space. Life would have been much simpler if the two events had simply been guaranteed to show up together.

The lesson, he said, is that the only way to find nontrivial ABI problems is to write real applications using the interface — before that interface is released to the world as a whole.

Another way to improve our interfaces is to write documentation, of course. Describing what you're doing makes you think more deeply about it, he said. It also makes the new interface easier for others to understand, lowering the barriers to participation. A well-written man page is one way to do this documentation, but not the only way.

Discovery and feedback

An ongoing problem area is discovery — there is no simple way to find out when a particular kernel ABI has changed. He doesn't have the time to follow everything on the linux-kernel list, and neither does anybody else. The linux-api list exists and should receive copies of patches that change interfaces, but that often fails to happen. So he relies on some scripts of his own to find changes, but they are imperfect. Often, interface changes are discovered by sheer luck when he stumbles across them. On rare occasion he'll actually get a man page for a new interface.

He is far from the only person interested in interface changes. Application developers, C library developers, the strace maintainers, the Linux Test Project, and more all want to know about them. But user-space developers are typically the last to learn about changes — except in the unfortunate cases where even the kernel developers don't know that they changed something. Some changes to POSIX message queues in 3.5 broke the interface, for example. 2.6.12 featured an unexpected change to fcntl(F_SETOWN) semantics. Nobody noticed until much later, at which point it was too late to fix things, since other programs depended on the new behavior. That is how we end up with options like F_SETOWN_EX, added in 2.6.32 in an attempt to fix the problems created by that change.

That last example highlights a problem in the kernel's feedback loops. There are generally at least six months between when a new interface is added to the kernel and when users actually see it. In the worst case, design bugs will only be discovered when users start to look closely at this interface; by then, it is usually too late to fix them. We really need to get feedback sooner, before the kernel is committed to a specific interface.

How do we get that feedback? Kerrisk's suggestions should not be surprising at this point: write a specification for the new feature, and write example programs that use it as well. Copy the patches liberally to the relevant mailing lists. Write documentation, or write an article for LWN. Don't just target kernel developers; publicize the details of the new interface broadly. Some developers, he said, have done all of these things, and the result has been far better interfaces. He called out Jeff Layton and his open file description locks (formerly file-private locks) work as an example of how to do it right.

Is all of this overkill? Maybe, but it results in making a lot of people's lives easier. Especially his, he allowed, but not only his. By doing this work, developers can help to get more people involved in the process of looking at a new interface; that is necessary, since he alone does not scale well. The original developer has all of the information needed to judge a new interface; by getting it out there, they can bring more eyeballs to bear and have a much better chance of getting the interface right the first time.

Slides from and video of the talk are available to those wanting more information.

[Your editor thanks Kernel Recipes for assisting with his travel to the event.]

Index entries for this article
Kernel	Development model/User-space ABI
Conference	Kernel Recipes/2016

Designing better kernel ABIs

Posted Oct 27, 2016 3:12 UTC (Thu) by deater (subscriber, #11746) [Link] (1 responses)

I'm currently trying to document a newish feature in one complex interface (perf_event_open) that interacts with an even worse interface (eBPF). Fun times.

The feature is not documented well, the sample code in the kernel commit that introduced the feature doesn't actually work, and the relevant bpf manpage section just says "to be documented".

The only plausible option is to try to reverse engineer what the "perf" tool does. Take a look at the perf code sometime, large sections of it have not a single code comment, and have fun things where bpf_object_load() calls bpf_object__load_progs() calls bpf_program__load() which calls load_program() which calls bpf_load_program() and eventually you just give up.

Designing better kernel ABIs

Posted Oct 27, 2016 23:46 UTC (Thu) by zlynx (guest, #2285) [Link]

I've often seen that in Java programs where there's some rule that certain modules can only talk to certain limited other modules.

So the programmer, instead of making a direct call from Module A to Module D has to call B, which calls C, which finally calls D.

And that's the simple form. Once a Java "architect" gets involved and the code goes all "dependency inversion" you have the above problem plus XML files.

But it sure does look pretty and organized in a graph with little dotted boxes around the modules.

Designing better kernel ABIs

Posted Oct 27, 2016 7:54 UTC (Thu) by richiejp (guest, #111135) [Link] (2 responses)

The LTP has a paid maintainer as well. Several other people are also paid to work on it. Perhaps you are aware of that, but it read a bit like you were implying that the self tests are better because there is a paid maintainer.

Designing better kernel ABIs

Posted Oct 27, 2016 12:30 UTC (Thu) by mkerrisk (subscriber, #1978) [Link]

My intended comparison here wasn't with LTP, but rather between " having a paid maintainer" and "having no paid maintainer". When someone is paid to do the task, you are mostly going to get better results than when it's squeezed in as someone's spare-time volunteer activity.

Designing better kernel ABIs

Posted Nov 7, 2016 8:41 UTC (Mon) by metan (subscriber, #74107) [Link]

I got the same feeling from the text as well, so I would say that the wording is not ideal. And I would say that LTP has a bit more manpower than selftest at this point, long gone are the days I've been fighting failing testcases all alone in the dark.

However I would like to say that there is no competition as well, we have the same goals, etc. But the focus is a bit different, hence we cannot just take LTP testscases and merge them into the kernel tree (as it was proposed several times). LTP is more about testing the stability of the system as a whole, there are stress test that take hours and we are also trying to be backward compatible so that latest LTP can run on currently supported enterprise distributions as well. The selftest, as I see it, is more of a quick unit test for the newly introduced functionality.

What I would love to do, on the other hand, is to unify the test API for both projects. We have new and quite nice test library that really simplifies test writing, and it has been tested in LTP for about half a year now. So maybe it's time to try to take the interesting parts and reuse them in selftest as well. And I really should write an article for LWN.net about the "driver model" for testcases we have in LTP now.

And lastly but not least, there are other problems with broken API as well. For instance the readahead() call shortens the count argument silently if kernel thinks that it is too large and returns success (zero). It can only fail if the file descriptor is not valid or readahead() for the particular fd is not implemented. It would be much easier to write testcases for it if the call returned how much was actually read ahead instead.

Designing better kernel ABIs

Posted Oct 27, 2016 12:38 UTC (Thu) by keroami (guest, #6921) [Link] (1 responses)

I am triggered by the word 'overkill'. By reordering work, it is LESS work.

1) The crux is in SOONER feedback. The work (writing progs, docs) will happen eventually (i.e. the feedback will happen eventually), so why not have it done sooner, rather than later?

Personally, I will forget all intricate details of what I am doing, so I am in a bad shape to receive feedback after a rather short amount of time (as short as a few days if I'm doing other intricate work).

2) Instead of writing all patch first and then all documentation, etc, try writing a little bit of functionality and the little bit of matching documentation. Same amount of work, but in a different order. Moreover, first write a small program that will use your unwritten bit of functionality. This gives you local feedback already.

3) Corollary of (2) It is easier for others to help you based on a small change, rather than a large change. That feeds back into (1).

4) Processing feedback should result in improvements. Improvements will save work in writing later functionality.

Designing better kernel ABIs

Posted Oct 27, 2016 13:05 UTC (Thu) by mkerrisk (subscriber, #1978) [Link]

Of course, I couldn't agree more.

Thanks for spelling out the conclusion that I wanted everyone to draw from this presentation ;-)

Designing better kernel ABIs

Posted Oct 27, 2016 13:09 UTC (Thu) by bandrami (guest, #94229) [Link] (4 responses)

Wait... why don't people want an explosion of capabilities? I thought that was the whole point -- shouldn't ideally every privileged call have its associated capability?

Designing better kernel ABIs

Posted Oct 27, 2016 13:17 UTC (Thu) by mkerrisk (subscriber, #1978) [Link] (3 responses)

Because there are something 1500 different capability checks in the kernel. We need to have silos of similar operations, rather than one capability for each check. Plus, the capability masks attached to files are currently 64 bits in size, of which 38 are so far used. Plus, having vast numbers of capabilities would make the life of sysadmins painful, in terms of trying to have a manageable grasp on what is going on on their system.

Designing better kernel ABIs

Posted Oct 27, 2016 13:27 UTC (Thu) by bandrami (guest, #94229) [Link] (2 responses)

Huh. From a design point I would assume making those silos is what my job as a sysadmin should be (though then again I generally disable capabilities and acls because they make security reasoning more complex than I like). I get the bitmask limit (I had thought it was 512 for some reason), but when I do use capabilities I always get worried about SYS_ADMIN and SETPCAP because from what I can see I might as well just be granting that process root at that point.

Designing better kernel ABIs

Posted Oct 27, 2016 15:46 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (1 responses)

> ... when I do use capabilities I always get worried about SYS_ADMIN and SETPCAP because from what I can see I might as well just be granting that process root at that point.

While fine access control granularity is generally a good thing, a number of capabilities are effectively equivalent to root or can be leveraged to obtain root given typical system configurations[1]. I see no reason why these should not be covered by a single CAP_SYS_ADMIN bit, reserving the remaining bits for those capabilities which can actually be isolated and contained.

[1] https://forums.grsecurity.net/viewtopic.php?f=7&t=2522 (Summary: 19 of the 35 capabilities then present were considered equivalent to being granted root access.)

Designing better kernel ABIs

Posted Oct 28, 2016 4:39 UTC (Fri) by mkerrisk (subscriber, #1978) [Link]

>> ... when I do use capabilities I always get worried
>> about SYS_ADMIN and SETPCAP because from what
>> I can see I might as well just be granting that
>> process root at that point.

(Yes, this issue one that I wrote about in an article linked to from the current article.)

> While fine access control granularity is generally a
> good thing, a number of capabilities are effectively equivalent
> to root or can be leveraged to obtain root given typical system
> configurations[1].

Yes, some capabilities can be leveraged to full root, but that doesn't necessarily make the scheme useless: the attacker still has to be able to execute the pathway that leverages to full root, so capabilities at least made the attacker's job harder.

> I see no reason why these should not be
> covered by a single CAP_SYS_ADMIN bit, reserving the
> remaining bits for those capabilities which can actually be
> isolated and contained.

The fundamental problem here is that by expanding CAP_SYS_ADMIN we exacerbate the existing problem that that capability really is as good (from an attacker's point of view) as traditional root. In fact, looking at the capabilities(7) man page to the (very partial) list of features enabled by CAP_SYS_ADMIN, there's a good argument that some of those could, and should, have been isolated out into some other silo, possibbly a new silo (capability) or a one of the existing silos. Here's a few cases that seem obvious to me:

              * perform IPC_SET and IPC_RMID operations on arbitrary Sys‐
                tem V IPC objects;
              * use ioprio_set(2) to assign IOPRIO_CLASS_RT  and  (before
                Linux 2.6.25) IOPRIO_CLASS_IDLE I/O scheduling classes;
              * employ CLONE_* flags  that  create  new  namespaces  with
                clone(2)  and  unshare(2) (but, since Linux 3.8, creating
                user namespaces does not require any capability);
              * call perf_event_open(2);
              * access privileged perf event information;
              * call  setns(2)  (requires  CAP_SYS_ADMIN  in  the  target
                namespace);

Designing better kernel ABIs

Posted Oct 27, 2016 17:16 UTC (Thu) by felixfix (subscriber, #242) [Link] (5 responses)

Mark new ABIs as beta for one year after introduction by tacking "_beta" onto the end of the syscall name. Fix bugs and stuff as usual, but also feel free to change the ABI spec itself from kernel to kernel, maybe only in major releases (4.1 to 4.2 ok, 4.1.1 to 4.1.2 no).

At the end of that beta period, retain the beta syscall and whatever spec badness is necessary, but fix as much as possible in the non-beta syscall.

One year later, remove the beta syscall.

There would be plenty of howls when the beta syscall did not get the fixes which went into the stable syscall, and probably more when the beta syscall was removed. But you'd be far better off down the road.

Designing better kernel ABIs

Posted Oct 28, 2016 4:48 UTC (Fri) by mkerrisk (subscriber, #1978) [Link] (3 responses)

I don't think such schemes would really work. There are many issues there. Many people would treat a Beta marking as "I won't bother even touching this". Others would actually start to depend on the API, and would howl, and exert pressure not to remove the Beta API, and in some cases they might even be successful (and they might even be correct to do so).

In any case, we've effectively done this sort of thing already. There have been cases where _freshly_ released APIs gor removed or changed a kernel release or two later, because it was (correctly) believed to that there would not be many (or, probably, any) users yet. The original timerfd() system call (later made into three system calls) and the paccept() API (later accept4()) are some such cases I recall, having had a hand in the changes. So, we've informally done this sort of thing already, but I don't think it would actually improve matters to formalize the process.

Designing better kernel ABIs - beta release

Posted Oct 29, 2016 23:00 UTC (Sat) by giraffedata (guest, #1954) [Link]

I think having a formal beta test of interface design, by initially naming a system call "xxx_beta" would work. People would use it for the same reason people always participate in beta tests, and even beg to do so: they want the function, and they're willing to pay the price of encountering bugs and having to change their use of it later. Of course, many others would regard it as "don't touch," and that is the point. We don't have to worry about hurting the people who didn't sign up for the risk.

I agree there would always be pressure not to change the beta interface. That is much like where someone designs a product to a draft standard and then argues the draft can't be changed because the product would then not comply. Sometimes they're successful; sometimes they aren't.

The decision to fix a new interface, incompatibly, is always painful. Are there really not many users of it yet? And is it OK to throw those few users under the bus? That decision is much easier when the function has "_beta" in its name and any user is definitely going to have to change his code, if only to call it by its non-beta name, no matter what.

Designing better kernel ABIs

Posted Oct 30, 2016 7:07 UTC (Sun) by dirtyepic (guest, #30178) [Link] (1 responses)

> In any case, we've effectively done this sort of thing already. There have been cases where
> _freshly_ released APIs gor removed or changed a kernel release or two later, because it
> was (correctly) believed to that there would not be many (or, probably, any) users yet.

Well as you say, these problems likely won't surface until someone tries to make use of the API. So, yes, you can change the interface if there aren't significant users but until there's significant users you won't know you need to change the interface.

I don't think adding beta to the call name is a good idea. Just state flat out that APIs are allowed to change for a short time after being introduced, until they are field tested and deemed stable, and that's just something people have to live with. I can't believe that forcing every future user of a system call to have to implement kludgy workarounds for broken behavior or poorly thought out interfaces is preferable to breaking a few things early in its lifetime, especially if that breakage results in an API that is easier to use, more functional, and consistent than it would be otherwise.

Designing better kernel ABIs

Posted Oct 30, 2016 14:58 UTC (Sun) by felixfix (subscriber, #242) [Link]

You say "Just state flat out that APIs are allowed to change for a short time after being introduced." The problem is knowing what "a short time after being introduced" is. If someone doesn't follow kernel changes much, they may have no idea how new or old an API is. That's one point of tacking "_beta" on the end -- it tells people right up front that this is new.

I also don't see much of a drawback to editing programs to remove the "_beta" a year later. Any program which never needs to be edited again is probably not being used much, so it's just a tiny edit if the API hasn't changed, or has only changed in trivial ways. If the beta interface was found lacking and needed major changes, then the beta callers need to revisit the API anyway.

Designing better kernel ABIs

Posted Nov 5, 2016 0:24 UTC (Sat) by Zolko (guest, #99166) [Link]

Mark new ABIs as beta for one year

wasn't this the whole point of the stable and development branches in the kernel naming ? Stable meaning "stable ABI" and development meaning "changing ABI" ?

Designing better kernel ABIs

Posted Oct 28, 2016 14:34 UTC (Fri) by jlayton (subscriber, #31672) [Link]

All of that is good advice, but I think one of the best things you can do when designing a new interface is to simply attempt to eliminate as much ambiguity in the interface as possible.

Are there multiple ways that an argument or field in a struct can be interpreted? Be very specific in how the interface will interpret it. Be specific about what errors will be returned, and under what conditions. All of that makes for a more tidy interface that is less apt to have problems later.

Designing better kernel ABIs

Posted Oct 28, 2016 18:49 UTC (Fri) by kpfleming (subscriber, #23250) [Link]

As I read this, I kept thinking that this article was asking for Test-Driven Development. Like all methodologies it's not a panacea, but for fixing the types of problems outlined (incompletely/inaccurately specified interfaces, inability to know whether the implementation follows the specification, etc.) it's a great tool.