|
|
Subscribe / Log in / New account

ZONE_DEVICE and the future of struct page

By Jonathan Corbet
March 21, 2017
LSFMM 2017
The opening session of the 2017 Linux Storage, Filesystem, and Memory-Management Summit covered a familiar topic: how to represent (possibly massive) persistent-memory arrays to various subsystems in the kernel. This session, led by Dan Williams, focused in particular on the ZONE_DEVICE abstraction and whether the kernel should use page structures to represent persistent memory or not.

ZONE_DEVICE is tied into the memory allocator's zone system (which segregates memory based on attributes like NUMA node or DMA reachability), but in a special way. It was created to satisfy the need to perform DMA operations on persistent memory; these operations require page structures to set up the mappings. ZONE_DEVICE is, he said, essentially the top half of the memory hotplug mechanism; it performs the memory setup, but does not actually put the pages online for general use. So memory located in ZONE_DEVICE cannot be allocated in the normal ways, pages cannot be migrated into that space, etc. But it is possible to get a page structure for memory in that zone.

Over the past few years, as the development community looked at the implications of large persistent-memory arrays, developers were concerned about the cost of using page structures — 64 bytes for every [Dan Williams] 4KB memory page. That usage seemed wasteful, so some significant effort went into trying to avoid using page structures altogether; instead, it was thought that the management of persistent memory could be done entirely with page-frame numbers (PFNs). The pfn_t type, along with a bunch of supporting structure, was added toward that goal, and developers tried to convert the entire DMA API to use PFNs. But then they ran into the SPARC64 architecture, which cannot create DMA mappings without using page structures. The pfn_t effort, Williams said, died there.

Now, he said, perhaps the time has come to stop trying to avoid struct page. If, instead, we let drivers assume that page structures will be available, we'll pay the memory-use cost in systems with terabytes of persistent memory, but we'll avoid dealing with a lot of custom driver code with inconsistent behavior. That would solve the DMA problem, but that's probably the easiest of the problems in this area; struct page tends to pop up in a lot of places.

Matthew Wilcox observed that, in truth, few drivers really care about struct page itself; it really just serves as a convenient handle for referring to physical memory. He suggested that it might make sense to go back and take a hard look at why SPARC is stuck with using page structures; Williams said it had to do with the management of cache aliasing state. James Bottomley suggested that there may be other ways to solve this problem, such as using a separate array to hold aliasing information. It would just be a matter of persuading SPARC maintainer Dave Miller.

If that persuasion could be accomplished, then pfn_t could be used nearly everywhere and there would be less need to worry about the availability of page structures. A remaining problem might be drivers that need to reach directly into DMA buffers but, Wilcox said, they should just use ioremap() to get a usable address to work with.

One of the big motivations for avoiding struct page with persistent-memory arrays is that these structures can end up filling a large portion of the system's ordinary memory. The way to avoid that, of course, is to allocate the structures in the persistent-memory itself; Wilcox said that, whenever new memory is added to the system, its associated page structures should always be located in that new memory. The problem is that struct page can be a heavily used structure, so there is value in having the ability to control its placement.

One possible solution to the memory-use problem is to allow page structures to refer to larger pages — 2MB huge pages, for example. The problem here is that making the size variable would add overhead to some of the hottest code paths in the kernel. There would be CPU-time savings in some areas, since the number of page structures to be managed would be reduced considerably, but there are doubts that the savings would make up for the higher costs in places like the page allocator.

Another option, Williams said, is to allocate page structures dynamically when they are needed. A persistent-memory array can be terabytes in size, but page structures may only be needed for a small portion of it. If allocation of page structures can be made cheap, it would make sense to only bring them into existence when the need arises.

The conversation wound down in a wandering manner. Bottomley suggested using radix trees to track ranges of memory instead. Kirill Shutemov pointed out that different kinds of information are needed for different page sizes; in the case of transparent huge pages, it may be necessary to refer to a 4KB page as both a single page and a component of a huge page. Rik van Riel said that page structures are only really an issue for dynamic RAM; they can be dispensed with for persistent memory, since filesystems can be counted on to free memory when it's no longer in use. Bottomley replied that this approach is possible, but nobody has been willing to implement it so far, leading Williams to observe that the group would be talking about the same problem again next year.

Index entries for this article
KernelMemory management/Nonvolatile memory
ConferenceStorage, Filesystem, and Memory-Management Summit/2017


to post comments

ZONE_DEVICE and the future of struct page

Posted Mar 21, 2017 15:34 UTC (Tue) by willy (subscriber, #9762) [Link]

Hi Jon; thanks for the write-up as always!

There was a certain amount of cross-talk and mis-speaking; drivers that need to reach into the scatterlist to manipulate the data need a kernel address. What we said yesterday was "They should be using kmap_pfn()", which is actually a gross oversimplification. For the benefit of our audience, on a 32-bit machine, the physical address may not be in lowmem, so you can't just do pfn_to_virt() or page_to_virt().

What I now believe is that we need a kmap_sg() and then drivers don't need to care whether there's a PFN or a struct page in the scatterlist; they're getting the virtual address that they need. I'm not sure whether we want a kmap_sg_atomic(). A quick grep tells me we already have scsi_kmap_atomic_sg() which looks ideal other than the "scsi_" prefix. We also have bvec_kmap_irq(), bio_kmap_irqj() and __bio_kmap_atomic().

ZONE_DEVICE and the future of struct page

Posted Mar 21, 2017 19:28 UTC (Tue) by roc (subscriber, #30627) [Link] (6 responses)

Why would anyone let SPARC64 requirements drive a design decision like this? That architecture is marginal and dying. And in this case it sounds like it could be worked around with some effort by the SPARC64 maintainer(s).

ZONE_DEVICE and the future of struct page

Posted Mar 21, 2017 21:49 UTC (Tue) by willy (subscriber, #9762) [Link]

I don't think it's accurate to say "marginal and dying". For one thing, it is my strong suspicion that we will see persistent memory on SPARC64 CPUs given Oracle's focus. It has a public roadmap going out to 2021.

We wouldn't let, say, FR-V or Alpha disrupt persistent memory features, but I think SPARC64 is still relevant.

ZONE_DEVICE and the future of struct page

Posted Mar 22, 2017 6:33 UTC (Wed) by flussence (guest, #85566) [Link] (4 responses)

The kernel's golden rule is “don't break userspace”. Intentionally breaking an entire class of currently-working systems for the sake of being lazy is a pretty awful thing to suggest.

ZONE_DEVICE and the future of struct page

Posted Mar 22, 2017 22:04 UTC (Wed) by roc (subscriber, #30627) [Link] (3 responses)

It's not about being lazy, it's about tradeoffs.

Suppose for the sake of argument that the only two options are #1 drop SPARC64 support or #2 all architectures must waste 64 bytes per 4K page. If the "don't break things" rule forces choice #1 then that means Linux performance on much more common systems is being dragged down by legacy baggage (especially after more similar decisions accumulate).

Now suppose there's another option, #3 waste no memory and do a bunch of rearchitecting of the SPARC64 port to handle it. That sounds good, but what if the SPARC64 maintainer doesn't want to do the work? If you're 100% committed to "don't break things" then you can't motivate them by threatening to drop SPARC64 support. Instead the burden of reworking SPARC64 falls on whoever's implementing the core feature. That really sucks for various reasons, but in particular you're making core development that benefits many users much more difficult for the sake of some relatively very small number of users.

"Lazy" is a pejorative term implying moral deficiency. There's nothing morally deficient about being honest about taking these tradeoffs seriously.

ZONE_DEVICE and the future of struct page

Posted Mar 23, 2017 22:25 UTC (Thu) by djbw (subscriber, #78104) [Link] (2 responses)

In fact, it's not a waste. It's fundamental to many kernel paths. The DAX enabling without pages loses get_user_pages() support which disables not only DMA / direct-I/O, but also fundamental operations like fork and ptrace. We're already paying this 1.5% overhead for main memory, and my argument is that we should simply pay that overhead for persistent memory as well. It's not enough to convert some paths to use pfn_t and with new kmap() primitives, because that leaves us an ongoing maintenance burden of dual code paths as developers add new struct page usages. Unless we create a plan to get rid of struct page everywhere we should not special case persistent memory... especially when we have a mechanism to pay the overhead cost from pmem itself.

Once we mandate struct page for DAX this appears to open up several clean up opportunities like re-using more of the the core page cache implementation and unifying device-DAX / filesystem-DAX.

ZONE_DEVICE and the future of struct page

Posted Mar 23, 2017 23:39 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

There are problems with that. Persistent memory is fast and durable, but not as fast and durable as the regular volatile RAM. So you have to put your page structures in the main RAM and this adds up quickly - a 2Tb persistent array will require around 32Gb or page structures.

ZONE_DEVICE struct page != ZONE_NORMAL struct page in terms of write rate

Posted Mar 24, 2017 0:15 UTC (Fri) by djbw (subscriber, #78104) [Link]

I'm not convinced that's going to be a problem in practice. Consider that the bulk of what makes struct page a frequently accessed data structure is when it is used by the core mm for general purpose page allocations. The ZONE_DEVICE mechanism never releases these pages for that high frequency usage. Another mitigation is that struct page writes are buffered by the cpu cache, which further reduces the write rate to media.


Copyright © 2017, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds