Transparent huge pages in 2.6.38

By Jonathan Corbet
January 19, 2011

The memory management unit in almost any contemporary processor can handle multiple page sizes, but the Linux kernel almost always restricts itself to just the smallest of those sizes - 4096 bytes on most architectures. Pages which are larger than that minimum - collectively called "huge pages" - can offer better performance for some workloads, but that performance benefit has gone mostly unexploited on Linux. That may change in 2.6.38, though, with the merging of the transparent huge page feature.

Huge pages can improve performance through reduced page faults (a single fault brings in a large chunk of memory at once) and by reducing the cost of virtual to physical address translation (fewer levels of page tables must be traversed to get to the physical address). But the real advantage comes from avoiding translations altogether. If the processor must translate a virtual address, it must go through as many as four levels of page tables, each of which has a good chance of being cache-cold, and, thus, slow. For this reason, processors maintain a "translation lookaside buffer" (TLB) to cache the results of translations. The TLB is often quite small; running cpuid on your editor's aging desktop machine yields:

   cache and TLB information (2):
      0xb1: instruction TLB: 2M/4M, 4-way, 4/8 entries
      0xb0: instruction TLB: 4K, 4-way, 128 entries
      0x05: data TLB: 4M pages, 4-way, 32 entries

So there is room for 128 instruction translations, and 32 data translations. Such a small cache is easily overrun, forcing the CPU to perform large numbers of address translations. A single 2MB huge page requires a single TLB entry; the same memory, in 4KB pages, would need 512 TLB entries. Given that, it's not surprising that the use of huge pages can make programs run faster.

The main kernel address space is mapped with huge pages, reducing TLB pressure from kernel code. The only way for user-space to take advantage of huge pages in current kernels, though, is through the hugetlbfs, which was extensively documented here in early 2010. Using hugetlbfs requires significant work from both application developers and system administrators; huge pages must be set aside at boot time, and applications must map them explicitly. The process is fiddly enough that use of hugetlbfs is restricted to those who really care and who have the time to mess with it. Hugetlbfs is often seen as a feature for large, proprietary database management systems and little else.

There would be real value in a mechanism which would make the use of huge pages easy, preferably requiring no development or administrative attention at all. That is the goal of the transparent huge pages (THP) patch, which was written by Andrea Arcangeli and merged for 2.6.38. In short, THP tries to make huge pages "just happen" in situations where they would be useful.

Current Linux kernels assume that all pages found within a given virtual memory area (VMA) will be the same size. To make THP work, Andrea had to start by getting rid of that assumption; thus, much of the initial part of the patch series is dedicated to enabling mixed page sizes within a VMA. Then the patch modifies the page fault handler in a simple way: when a fault happens, the kernel will attempt to allocate a huge page to satisfy it. Should the allocation succeed, the huge page will be filled, any existing small pages in the new page's address range will be released, and the huge page will be inserted into the VMA. If no huge pages are available, the kernel falls back to small pages and the application never knows the difference.

This scheme will increase the use of huge pages transparently, but it does not yet solve the whole problem. Huge pages must be swappable, lest the system run out of memory in a hurry. Rather than complicate the swapping code with an understanding of huge pages, Andrea simply splits a huge page back into its component small pages if that page needs to be reclaimed. Many other operations (mprotect(), mlock(), ...) will also result in the splitting of a page.

The allocation of huge pages depends on the availability of large, physically-contiguous chunks of memory - something which Linux kernel programmers can never count on. It is to be expected that those pages will become available at inconvenient times - just after a process has faulted in a number of small pages, for example. The THP patch tries to improve this situation through the addition of a "khugepaged" kernel thread. That thread will occasionally attempt to allocate a huge page; if it succeeds, it will scan through memory looking for a place where that huge page can be substituted for a bunch of smaller pages. Thus, available huge pages should be quickly placed into service, maximizing the use of huge pages in the system as a whole.

The current patch only works with anonymous pages; the work to integrate huge pages with the page cache has not yet been done. It also only handles one huge page size (2MB). Even so, some useful performance improvements can be seen. Mel Gorman ran some benchmarks showing improvements of up to 10% or so in some situations. In general, the results were not as good as could be obtained with hugetlbfs, but THP is much more likely to actually be used.

No application changes need to be made to take advantage of THP, but interested application developers can try to optimize their use of it. A call to madvise() with the MADV_HUGEPAGE flag will mark a memory range as being especially suited to huge pages, while MADV_NOHUGEPAGE will suggest that huge pages are better used elsewhere. For applications that want to use huge pages, use of posix_memalign() can help to ensure that large allocations are aligned to huge page (2MB) boundaries.

System administrators have a number of knobs that they can tweak, all found under /sys/kernel/mm/transparent_hugepage. The enabled value can be set to "always" (to always use THP), "madvise" (to use huge pages only in VMAs marked with MADV_HUGEPAGE), or "never" (to disable the feature). Another knob, defrag, takes the same values; it controls whether the kernel should make aggressive use of memory compaction to make more huge pages available. There's also a whole set of parameters controlling the operation of the khugepaged thread; see Documentation/vm/transhuge.txt for all the details.

The THP patch has had a bit of a rough ride since being merged into the mainline. This code never appeared in linux-next, so it surprised some architecture maintainers when it caused build failures in the mainline. Some bugs have also been found - unsurprising for a patch which is this large and which affects so much core code. Those problems are being ironed out, so, while 2.6.38-rc1 testers might want to be careful, THP should be in a usable state by the time the final 2.6.38 kernel is released.

Index entries for this article
Kernel	Huge pages
Kernel	Memory management/Huge pages

Transparent huge pages in 2.6.38

Posted Jan 20, 2011 6:40 UTC (Thu) by jreiser (subscriber, #11027) [Link]

Huge pages can increase data cache performance by making aliasing (the mapping of address to cache set) more uniform within the huge page, in contrast to the mappings for many equivalent collections of small pages. The difference can be several percent or more.

Transparent huge pages in 2.6.38

Posted Jan 20, 2011 12:27 UTC (Thu) by rfrancoise (subscriber, #15508) [Link]

Andrea gave a talk on THP at the KVM Forum 2010 with some interesting benchmark results: slides, video.

Transparent huge pages in 2.6.38

Posted Jan 20, 2011 15:36 UTC (Thu) by Tuna-Fish (guest, #61751) [Link]

Transparent hugepage support is very interesting at the moment -- especially because both main x86 vendors are beefing up the support for them in their processors. Intel just added real support for 1GiB pages, but AMD takes the jackpot with the DTLB in the upcoming Bulldozer -- 72 L1 entries and 1024 L2 entries, holding any combination of 4kiB, 2MiB or 1GiB pages.

Hugetlbfs pages are dynamically allocate-able

Posted Jan 20, 2011 17:12 UTC (Thu) by emunson (subscriber, #44357) [Link] (3 responses)

Your description of using huge pages via hugetlbfs is not quite correct. Most modern kernels and architectures support dynamically allocating huge pages after boot.

Hugetlbfs pages are dynamically allocate-able

Posted Jan 21, 2011 6:38 UTC (Fri) by Tuna-Fish (guest, #61751) [Link] (2 responses)

Only if there is contiguous real memory available. Under real-world situations, there rarely is.

Just try allocating space on hugetlbfs after running an active web server for a few hours.

Hugetlbfs pages are dynamically allocate-able

Posted Jan 21, 2011 9:31 UTC (Fri) by jthill (subscriber, #56558) [Link]

I think the memory compaction patch is intended to fix that:

Mel ran some simple tests showing that, with compaction enabled, he was able to allocate over 90% of the system's memory as huge pages while simultaneously decreasing the amount of reclaim activity needed.

Hugetlbfs pages are dynamically allocate-able

Posted Jan 21, 2011 15:35 UTC (Fri) by emunson (subscriber, #44357) [Link]

The presence of contiguous memory will be entirely dependant on the system and work load. You are correct that allocating huge pages becomes more difficult as memory is fragmented. My reply was to the section of the article that said hugetlbfs based huge pages must be set aside at boot time which is not correct for all page sizes. On systems that support them, 1GB and 16GB pages must be reserved at boot, but 2MB, 4MB, and 16MB pages can be allocated any time there is contiguous space.

Transparent huge pages in 2.6.38

Posted Mar 20, 2011 20:29 UTC (Sun) by pfefferz (guest, #57490) [Link]

Hugepages are a big deal for Mobile SoCs. Designers preallocate large chunks of physical memory to ensure that their encode/decode blocks operate on contiguous memory. This large memory gets locked out of the system forever. This leads to increased system costs because manufactures need to put down more memory than they'd like to.

Some SoC manufactures have started using IOMMUs to map memory, but they're running up against TLB depth which they solve by using hugepages instead of regular pages. This support should allow these IOMMUs to map memory at runtime with hugepages and theoretically allow manufactures to use less memory. Of course that won't happen since hugepages are very scarce.

I wrote an IOMMU prototype that used its own allocator and presented it at OLS in 2010, The Virtual Contiguous Memory Manager. I think the Samsung guys put together something based on its ideas.