Virtually mapped kernel stacks

By Jonathan Corbet
June 22, 2016

The kernel stack in Linux is arguably a weak point in the system's design: it is small enough that kernel developers must constantly be aware of what they put on the stack to avoid overflows. But those overflows happen anyway, even in the absence of an attacker who is trying to force the issue — and, as Jann Horn recently demonstrated, there are reasons why attackers might want to force a stack overflow. When an overflow does occur, the kernel is poorly placed to even detect the problem, much less act on it. The stack has changed little over the lifetime of the kernel, but some recent work has the potential to, finally, make the kernel stack much more robust.

How current kernels stack up

Each process has its own stack for use when it is running in the kernel; in current kernels, that stack is sized at either 8KB or (on 64-bit systems) 16KB of memory. The stack lives in directly-mapped kernel memory, so it must be physically contiguous. That requirement alone can be problematic since, as memory gets fragmented, finding two or four physically contiguous pages can become difficult. The use of directly mapped memory also rules out the use of guard pages — non-accessible pages that would trap an overflow of the stack — because adding a guard page would require wasting an actual page of memory.

As a result, there is no immediate indication if the kernel stack has overflowed. Instead, a stack that grows too large simply overwrites whatever memory is located just below the allocated range (below because stacks grow downward on most architectures). There are options to detect overflows by putting canaries on the stack, and development options can track stack usage. But if a stack overflow is detected at all on a production system, it is often well after the actual event and after an unknown amount of damage has been done.

For added fun, there is also a crucial data structure — the thread_info structure — placed at the bottom of the stack area. So if the kernel stack overflows, the thread_info, which provides access to almost everything the kernel knows about the running process, will be overwritten first. Needless to say, that makes stack overruns even more interesting to attackers; it's hard to know what will be placed below the stack in memory, but the thread_info structure is a known quantity.

It is not surprising, then, that kernel developers work hard to avoid stack overflows. On-stack allocations (usually in the form of automatic variables) are examined closely, and, as a general rule, recursion is not allowed. But surprises can come in a number of forms, from a careless variable declaration to unexpectedly deep call chains. The storage subsystem, where filesystems, storage technologies, and networking code can be stacked up to arbitrary depths, is particularly prone to such problems. This sort of surprise led to the expansion of the x86-64 kernel stack to 16KB for the 3.15 release, but there are limits to how big the kernel stack can be. Since there is one stack for every process in the system, any increase in its size is felt many times over.

The problem of avoiding stack overflows is likely to remain a challenge for kernel developers for some time, but it should be possible for the kernel to respond better when an overflow does happen. The key to doing so, as can be seen in Andy Lutomirski's virtually mapped stacks patch set, is to change how kernel stacks are allocated.

Virtually mapped stacks

Almost all memory that is directly accessed by the kernel is reached via addresses in the directly mapped range. That range is a large chunk of address space that is mapped to physical memory in a simple, linear fashion, so that, for all practical purposes, it looks as if the kernel is working with physical memory addresses. On 64-bit systems, all of memory is mapped in this way; 32-bit systems do not have the ability to fully map the amount of memory found in current systems, so more complicated games must be played.

Linux is a virtual-memory system, though, and so the kernel uses virtual addresses to reach memory, even in the directly mapped range. As it happens, the kernel reserves another range of addresses for virtually mapped memory; this range is used when memory is allocated with vmalloc() and is, consequently, called the "vmalloc range." Allocations in this range are pieced together a page at a time and are not physically contiguous. Traditionally, the use for this range is to obtain a relatively large chunk of memory that needs to be virtually contiguous, but which can be physically scattered.

There is (almost! — see below) no need for kernel stacks to be physically contiguous, so they could, in principle, be allocated as individual pages and mapped into the vmalloc area. Doing so would eliminate one of the biggest uses of larger (physically contiguous) allocations in the kernel, making the system more robust when memory is fragmented. It also would allow the placement of no-access guard pages around the allocated stacks without the associated memory waste (since all that is required is a page-table entry), allowing the kernel to know immediately if it ever overruns an allocated stack. Andy's patch does just this — it allocates kernel stacks from the vmalloc area. While he was at it, he added graceful handling of overflows; a proper, non-corrupt oops message is printed, and the overflowing process is killed.

The patch set itself is relatively simple, with most of the patches dealing with the obnoxious architecture-specific details needed to make it work. It seems like a significant improvement to the kernel, and the reviews have been generally positive. There are a few outstanding issues, though.

Inconvenient details

One of those is performance; allocating a stack from the vmalloc area, Andy says, makes creating a process with clone() take about 1.5µs longer. Some workloads are highly sensitive to process-creation overhead and would suffer with this change, so it is perhaps unsurprising that Linus responded by saying that "that problem needs to be fixed before this should be merged." Andy thinks that much of the cost could be fixed by making vmalloc() (which has never been seriously optimized for performance) faster; Linus, instead, suggests keeping a small, per-CPU cache of preallocated stacks. He has, in any case, made it clear that he wants the performance regression dealt with before the change can go in.

Another potential cost that has not yet been measured is an increase in translation misses. The directly mapped area uses huge-page mappings, so the entire kernel (all of its code, data, and stacks) can fit in a single translation lookaside buffer (TLB) entry. The vmalloc area, instead, creates another window into memory using single-page mappings. Since references to kernel stacks are common, the possibility of an increase in TLB misses is real if those stacks are reached via the vmalloc area.

One other important little detail is that, while allocations from the vmalloc area include guard pages, those pages are placed after the allocation. For normal heap memory, that is where overruns tend to happen. But stacks grow downward, so a stack overrun will overwrite memory ahead of the allocation instead. In practice, as long as a guard page is placed at the beginning of the vmalloc area, the current code will ensure that there are guard pages between each pair of allocations, so the pre-allocation page should be there. But, given that the guard pages are one of the primary goals of the patch set, some tweaks may be needed to be sure that they are always placed at the beginning of each stack.

Memory mapped into the vmalloc range has one specific constraint: it cannot easily be used for direct memory access (DMA) I/O. That is because such I/O expects a memory range to be physically contiguous, and because the virtual-to-physical mapping address functions do not expect addresses in that range. As long as no kernel code attempts to perform DMA from the stack this should not be a problem. DMA from the stack is problematic for other reasons, but it turns out that there is some code in the kernel that does it anyway. That code will have to be fixed before this patch can be widely used.

Finally, kernels with this patch set applied will detect an overflow of the kernel stack, but there is still the little problem of the thread_info structure living at the bottom of each stack. An overrun that overwrites only this structure, without overrunning the stack as a whole, will not be detected. The proper solution here is to move the thread_info structure away from the kernel stack entirely. The current patch set does not do that, but Andy has said that he intends to tackle that problem once these patches are accepted.

That acceptance seems likely once the current problems have been dealt with. Giving the kernel proper detection and handling of stack overruns will remove an important attack vector and simply make Linux systems more robust. It is hard to complain about changes like that.

Index entries for this article
Kernel	Kernel stack
Kernel	vmalloc()
Security	Linux kernel

Virtually mapped kernel stacks

Posted Jun 23, 2016 12:15 UTC (Thu) by spender (guest, #23067) [Link] (3 responses)

What a great idea! Why didn't I think of this years ago?

I guess this is another example of what happens when upstream "punishes" us by refusing to credit us while ripping off research, ideas, and implementations. When the upstream author fails to do so, articles like this repeat it (because according to the editorial strategy here, whatever is mentioned on public upstream mailing lists is the only truth).

Is it really so difficult for people to act decently instead of playing these silly games walking the edge of plagiarism to protect their ego?

Some reading material:
https://lwn.net/Articles/600821/
https://lkml.org/lkml/2016/6/15/1064
https://lkml.org/lkml/2016/6/21/826
http://www.openwall.com/lists/kernel-hardening/2016/06/23/1

So after all the pointless bikeshedding, we're back to my initial KSTACKOVERFLOW implementation, doing the exact same things on the same platform. Of course, as Andy has discovered (again by looking at code he didn't credit in his patches) this has no chance of working anytime soon, particularly when specific debugging options (SG_DEBUG, etc) are enabled. Prior to changing my implementation, I had fixed up dozens of these DMA on stack issues which had persisted in the kernel for years and had also been added in new staging code. This implementation will both need to fix all the things I fixed, as well as all the issues my new implementation automatically handles or it'll simply break people's machines.

-Brad

Virtually mapped kernel stacks

Posted Jun 23, 2016 14:19 UTC (Thu) by itvirta (guest, #49997) [Link]

Whatever happened to the "try to be polite and respectful" part of communicating with other people.

Virtually mapped kernel stacks

Posted Jun 25, 2016 1:40 UTC (Sat) by luto (subscriber, #39314) [Link]

I'm just a person, and I don't have any special ability to speak for "upstream". I certainly have no intention of punishing anyone or ripping off anyone's work. My patchset isn't based on grsecurity. I suspect it works quite differently from grsecurity's implementation, but I haven't actually looked at grsecurity's implementation in any great detail.

My understanding is that grsecurity has some special cases that make DMA on the stack continue working. I'd guess that this type of misuse in the kernel was much more widespread several years ago than it is now, since the DMA API currently has the ability to warn if the stack is used for DMA and people have put some effort into fixing the offenders.

I could mention in the commit message that GRKERNSEC_KSTACKOVERFLOW has had a similar feature for years.

Virtually mapped kernel stacks

Posted Jul 1, 2016 21:13 UTC (Fri) by Wol (subscriber, #4433) [Link]

> Is it really so difficult for people to act decently instead of playing these silly games walking the edge of plagiarism to protect their ego?

Darwin, meet Wallace. (Or is it the other way around?)

Cheers,
Wol

Virtually mapped kernel stacks - ongoing effort

Posted Jun 23, 2016 14:12 UTC (Thu) by ds2horner (subscriber, #13438) [Link]

http://article.gmane.org/gmane.linux.kernel/2251521

> When a task goes away, one reference is held until the next RCU grace
> period so that task_struct can be used under RCU (look for
> delayed_put_task_struct).

Yeah, that RCU batching will screw the cache idea.

But isn't it only the "task_struct" that needs that? That's a separate
allocation from the stack, which contains the "thread_info".

I think that what we *could* do is re-use the tread-info within the
RCU grace period, as long as we delay freeing the task_struct.

Yes, yes, we currently tie the task_struct and thread_info lifetimes
together very tightly, but that's a historical thing rather than a
requirement. We do the

account_kernel_stack(tsk->stack, -1);
arch_release_thread_info(tsk->stack);
free_thread_info(tsk->stack);

in free_task(), but I could imagine doing it earlier, and
independently of the RCU-delayed free.

Virtually mapped kernel stacks

Posted Jun 24, 2016 1:33 UTC (Fri) by samlh (subscriber, #56788) [Link]

Thank you for the interesting article, Jon! I find this kind of thing cool, but I quickly lose track of all the email threads, so I appreciate the work you put in to summarize what is happening.

Virtually mapped kernel stacks

Posted Jun 24, 2016 12:58 UTC (Fri) by ppisa (subscriber, #67307) [Link] (1 responses)

> The proper solution here is to move the thread_info structure away
> from the kernel stack entirely. The current patch set does not do that,
> but Andy has said that he intends to tackle that problem once these
> patches are accepted.

I think that breaking direct relation between thread_info and kernel task
stack (per thread - not per process as can be misinterpreted from article)
would be really bad decision. This way significant overhead would be caused for transforming running CPU state vector (SP in Linux case) to pointer to the thread_info pointer ("current").

I have in my mind immediately following solution when I have read about current kernel stack security affair week or so ago.

Use pagetable to protect stack for sure. Use 32kB aligned and 32 kB long virtual address range for each 16 kB stack (start address abbreviated sptr, alignment salig). Use page table entries next way

sptr + 0 * pagesize ... valid page ... thread_info
sptr + 1 * pagesize ... not present page ... guard against stack overflow
sptr + 2 * pagesize ... valid page ... top of the stack
sptr + 3 * pagesize ... valid page ... stack
...
sptr + n * pagesize ... valid page ... bottom of the stack
sptr + (n - 1) * pagesize ... not present page ... guard against buffer overflow (index out of local variable size)

Then "current " implementation can stay the same as it is now

current = current_sp & ~(salig - 1)

Stack size can be adjusted as required when n <= (salig / pagesize) - 3 is held.

There is small inefficiency that whole page is reserved for thread_info which can be some waste of the space but I expect that typical size of this structure is quite large today.

There is even another optimization possible for 64-bit architectures. They have abundant virtual space for today hardware (2^47 for AMD64 kernel) so it would be possible to reserve some large enough virtual range for kernel stacks allocation and place kernel stack on address which is directly computed from top level PID/task number. Then no vmalloc allocation is required to reserve unique virtual address area for new task stack. The size of the range is not so huge for today hardware. 64-bit linux kennel allows 4 millions tasks max. If 32 kB aligned stack slots are considered then 4 M * 32kB corresponds to 15 + 22 bits => 37 bits which is acceptable and mapping from PID to thread_info would be fast, actual tread state to current as well etc.

Virtually mapped kernel stacks

Posted Jun 25, 2016 1:18 UTC (Sat) by luto (subscriber, #39314) [Link]

There are only two serious things that live in thread_info and matter for performance: flags and cpu. Cpu would probably perform better in task_struct if it could share a cache line with the rest of the scheduler stuff. That leavs flags.

On new x86 kernels, most of the flags usage is in C. Moving it to task_struct will add one level of indirection. I'll benchmark it at some point, but I doubt it matters.

Anyway, as Linus pointed out, a better solution might be possible: stick flags for the running task in percpu memory instead of in per-task memory. This would make it faster than even the thread_info approach. Avoiding breaking signal handling while doing this could be interesting.

Glad to see it!

Posted Jun 27, 2016 1:02 UTC (Mon) by david.a.wheeler (guest, #72896) [Link]

I love to see general hardening solutions an important software like the Linux kernel. I'm very glad to see this.

Virtually mapped kernel stacks

Posted Jul 1, 2016 21:57 UTC (Fri) by nix (subscriber, #2304) [Link] (1 responses)

A doubtless stupid question from someone who never paid that much attention to this area of things: if the thread_info is below the stack and vulnerable to overruns, why not just move it to the top of the stack page and start the stack immediately below it? It's fixed-size, after all, so it should be easy to start the stack pointer right below it. The stack would then grow away from the thread_info, and overruns could not run into it (only underruns, which should be much rarer, one hopes).

Virtually mapped kernel stacks

Posted Jul 1, 2016 22:09 UTC (Fri) by corbet (editor, #1) [Link]

The placement at the bottom was initially done so that it could be easily located just by aligning the stack pointer. It's not done that way anymore, so that doesn't matter much. Moving thread_info to the top would make it harder to overwrite, but doesn't solve the other problems that come with having it in that bit of memory. If you're going to change things, it seems better to just move it out entirely.