ZONE_DEVICE and the future of struct page
ZONE_DEVICE is tied into the memory allocator's zone system (which segregates memory based on attributes like NUMA node or DMA reachability), but in a special way. It was created to satisfy the need to perform DMA operations on persistent memory; these operations require page structures to set up the mappings. ZONE_DEVICE is, he said, essentially the top half of the memory hotplug mechanism; it performs the memory setup, but does not actually put the pages online for general use. So memory located in ZONE_DEVICE cannot be allocated in the normal ways, pages cannot be migrated into that space, etc. But it is possible to get a page structure for memory in that zone.
Over the past few years, as the development community looked at the implications of large persistent-memory arrays, developers were concerned about the cost of using page structures — 64 bytes for every 4KB memory page. That usage seemed wasteful, so some significant effort went into trying to avoid using page structures altogether; instead, it was thought that the management of persistent memory could be done entirely with page-frame numbers (PFNs). The pfn_t type, along with a bunch of supporting structure, was added toward that goal, and developers tried to convert the entire DMA API to use PFNs. But then they ran into the SPARC64 architecture, which cannot create DMA mappings without using page structures. The pfn_t effort, Williams said, died there.
Now, he said, perhaps the time has come to stop trying to avoid struct page. If, instead, we let drivers assume that page structures will be available, we'll pay the memory-use cost in systems with terabytes of persistent memory, but we'll avoid dealing with a lot of custom driver code with inconsistent behavior. That would solve the DMA problem, but that's probably the easiest of the problems in this area; struct page tends to pop up in a lot of places.
Matthew Wilcox observed that, in truth, few drivers really care about struct page itself; it really just serves as a convenient handle for referring to physical memory. He suggested that it might make sense to go back and take a hard look at why SPARC is stuck with using page structures; Williams said it had to do with the management of cache aliasing state. James Bottomley suggested that there may be other ways to solve this problem, such as using a separate array to hold aliasing information. It would just be a matter of persuading SPARC maintainer Dave Miller.
If that persuasion could be accomplished, then pfn_t could be used nearly everywhere and there would be less need to worry about the availability of page structures. A remaining problem might be drivers that need to reach directly into DMA buffers but, Wilcox said, they should just use ioremap() to get a usable address to work with.
One of the big motivations for avoiding struct page with persistent-memory arrays is that these structures can end up filling a large portion of the system's ordinary memory. The way to avoid that, of course, is to allocate the structures in the persistent-memory itself; Wilcox said that, whenever new memory is added to the system, its associated page structures should always be located in that new memory. The problem is that struct page can be a heavily used structure, so there is value in having the ability to control its placement.
One possible solution to the memory-use problem is to allow page structures to refer to larger pages — 2MB huge pages, for example. The problem here is that making the size variable would add overhead to some of the hottest code paths in the kernel. There would be CPU-time savings in some areas, since the number of page structures to be managed would be reduced considerably, but there are doubts that the savings would make up for the higher costs in places like the page allocator.
Another option, Williams said, is to allocate page structures dynamically when they are needed. A persistent-memory array can be terabytes in size, but page structures may only be needed for a small portion of it. If allocation of page structures can be made cheap, it would make sense to only bring them into existence when the need arises.
The conversation wound down in a wandering manner. Bottomley suggested
using radix trees to track ranges of memory instead. Kirill Shutemov
pointed out that different kinds of information are needed for different
page sizes; in the case of transparent huge pages, it may be necessary to
refer to a 4KB page as both a single page and a component of a huge page.
Rik van Riel said that page structures are only really an issue
for dynamic RAM; they can be dispensed with for persistent memory, since
filesystems can be counted on to free memory when it's no longer in use.
Bottomley replied that this approach is possible, but nobody has been
willing to implement it so far, leading Williams to observe that the group
would be talking about the same problem again next year.
Index entries for this article | |
---|---|
Kernel | Memory management/Nonvolatile memory |
Conference | Storage, Filesystem, and Memory-Management Summit/2017 |
Posted Mar 21, 2017 15:34 UTC (Tue)
by willy (subscriber, #9762)
[Link]
There was a certain amount of cross-talk and mis-speaking; drivers that need to reach into the scatterlist to manipulate the data need a kernel address. What we said yesterday was "They should be using kmap_pfn()", which is actually a gross oversimplification. For the benefit of our audience, on a 32-bit machine, the physical address may not be in lowmem, so you can't just do pfn_to_virt() or page_to_virt().
What I now believe is that we need a kmap_sg() and then drivers don't need to care whether there's a PFN or a struct page in the scatterlist; they're getting the virtual address that they need. I'm not sure whether we want a kmap_sg_atomic(). A quick grep tells me we already have scsi_kmap_atomic_sg() which looks ideal other than the "scsi_" prefix. We also have bvec_kmap_irq(), bio_kmap_irqj() and __bio_kmap_atomic().
Posted Mar 21, 2017 19:28 UTC (Tue)
by roc (subscriber, #30627)
[Link] (6 responses)
Posted Mar 21, 2017 21:49 UTC (Tue)
by willy (subscriber, #9762)
[Link]
We wouldn't let, say, FR-V or Alpha disrupt persistent memory features, but I think SPARC64 is still relevant.
Posted Mar 22, 2017 6:33 UTC (Wed)
by flussence (guest, #85566)
[Link] (4 responses)
Posted Mar 22, 2017 22:04 UTC (Wed)
by roc (subscriber, #30627)
[Link] (3 responses)
Suppose for the sake of argument that the only two options are #1 drop SPARC64 support or #2 all architectures must waste 64 bytes per 4K page. If the "don't break things" rule forces choice #1 then that means Linux performance on much more common systems is being dragged down by legacy baggage (especially after more similar decisions accumulate).
Now suppose there's another option, #3 waste no memory and do a bunch of rearchitecting of the SPARC64 port to handle it. That sounds good, but what if the SPARC64 maintainer doesn't want to do the work? If you're 100% committed to "don't break things" then you can't motivate them by threatening to drop SPARC64 support. Instead the burden of reworking SPARC64 falls on whoever's implementing the core feature. That really sucks for various reasons, but in particular you're making core development that benefits many users much more difficult for the sake of some relatively very small number of users.
"Lazy" is a pejorative term implying moral deficiency. There's nothing morally deficient about being honest about taking these tradeoffs seriously.
Posted Mar 23, 2017 22:25 UTC (Thu)
by djbw (subscriber, #78104)
[Link] (2 responses)
Once we mandate struct page for DAX this appears to open up several clean up opportunities like re-using more of the the core page cache implementation and unifying device-DAX / filesystem-DAX.
Posted Mar 23, 2017 23:39 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Mar 24, 2017 0:15 UTC (Fri)
by djbw (subscriber, #78104)
[Link]
ZONE_DEVICE and the future of struct page
ZONE_DEVICE and the future of struct page
ZONE_DEVICE and the future of struct page
ZONE_DEVICE and the future of struct page
ZONE_DEVICE and the future of struct page
ZONE_DEVICE and the future of struct page
ZONE_DEVICE and the future of struct page
ZONE_DEVICE struct page != ZONE_NORMAL struct page in terms of write rate