Kernel development
Brief items
Kernel release status
The current development kernel is 3.15-rc8, which was released on June 1. At that time, Linus
Torvalds also opened the merge window for 3.16. He is trying to avoid
having the merge window open during an upcoming family vacation. "So
let's try to see how well that works - the last weeks of the
release tends to be me just waiting around to make sure nothing bad is
happening, so doing this kind of overlapping development *should* work
fine. Maybe it works so well that we'll end up doing it in the future
even if there *isn't* some kind of scheduling conflict that makes me
want to start the merge window before I'm 100% comfortable doing the
release for the previous version.
"
See our merge window article for more
information about what has been merged so far.
Stable updates: The 3.14.5 and 3.10.41 stable kernels were release on May 31. There are no stable updates in the review process as of this writing.
Kernel development news
3.16 merge window, part 1
The merge window for the 3.16 kernel might be showing us a glimpse of a future where kernel releases happen even more frequently than they do today. By opening the window for 3.16 before the final release of the 3.15 kernel, Linus Torvalds may have shaved a week off the time between the two kernels. The length of kernel development cycles has generally trended downward, but has leveled off between 60 and 70 days for recent releases. While Torvalds's reason for overlapping the development cycles of two kernels—a family vacation—may not recur anytime soon, he may find that some parallelism in kernel development suits his purposes moving forward.
So, unlike previous merge windows, Torvalds is juggling two branches for a week—or possibly longer if serious problems pop up in -rc8. There is the mainline (or "master") branch on his tree that is accumulating the—hopefully small—fixes that are going into 3.15. In addition, he is managing a "next" branch that is collecting all of the changes bound for 3.16 (i.e. the merge window changes). Once 3.15 is released, he will presumably merge next to master and keep on merging from there.
As he said in the -rc8 release
announcement, this part
of the development cycle is typically fairly boring for Torvalds and the
rest of the kernel hackers.
Normally, Torvalds is "just waiting around to make sure nothing bad is
happening
" for the last few weeks of each cycle. If
this "experiment" works well, one—or even two—week overlaps between kernel
cycles could become a regular occurrence. That could increase the
already
frenetic pace of kernel development substantially.
As of this writing, Torvalds has pulled 5348 non-merge changes for 3.16 (and 54 into the mainline after the v3.15-rc8 tag). Since we are in uncharted territory, it is a little hard to say for sure when the merge window will close, but one could guess that it will before he leaves on vacation, so an -rc1 on or about June 15 seems just about right.
Changes visible to users include:
- Xen on ARM systems now supports suspend and resume.
- SMP support has been added for Marvell Armada 375 and 38x SoCs. SMP has been reworked for the Allwinner A31 SoC.
- The Goldfish virtual platform now has 64-bit support.
- Early debug serial consoles have been made generic and support for early consoles on the p1011 serial port has been added.
- KVM on s390 gained some optimizations, support for migration, and GDB support.
- KVM has added initial little-endian support for POWER8. The project has also done MIPS user-space interface and virtualized timer work along with adding support for nested fully-virtualized Xen guests on x86 hosts.
- ACPI video will now default to using native backlight drivers, rather
than the
ACPI backlight interface, "
which should generally help systems with broken Win8 BIOSes
", Rafael Wysocki said in the pull request. - New hardware support includes:
- Systems and processors: Support for several ARM system-on-chips (SoCs) has been added via device tree bindings, including ST Microelectronics STiH407; Freescale i.MX6SX; Samsung EXYNOS 3250, 5260, 5410, 5420, and 5800; and LSI Axxia AXM55xx.
- Audio: Behringer BCD2000 DJ controllers; NVIDIA Tegra HD Audio controllers; FireWire devices based on the Echo Digital Audio Fireworks board; FireWire devices based on BridgeCo DM1000/DM1100/DM1500 with BeBoB firmware; SoC Audio for Freescale i.MX CPUs; TI STA350 speaker amplifiers; Realtek ALC5651 codecs; Analog Devices ADAU1361 and ADAU1761 codecs; Analog Devices ADAU1381 and ADAU1781 codecs; Cirrus Logic CS42L56 low-power stereo codecs; Intel Baytrail with MAX98090 codecs; Realtek ALC5677 codecs; Google Snow boards.
- Sensors: AS3935 Franklin lightning sensors; Asahi Kasei AK8963 magnetometers; Invensense MPU6500 gyroscope/accelerometers; Freescale MPL115A2 pressure sensors; Melexis MLX90614 contact-less infrared sensors; Freescale MMA8452Q accelerometers; Nuvoton NCT6683D hardware-monitoring chips.
- Miscellaneous: SSI (Synchronous Serial Interface, aka McSAAB) protocol support; OMAP3 SSI; Nokia N900 modems; Renesas R-Car PCIe controllers; Maxim MAX77836 Micro-USB interface controllers (MUIC); Analog Devices AD799x analog-to-digital converters (ADC) graduated from staging; Microchip Technology MCP3426, MCP3427, and MCP3428 ADCs; HID device rotation; MEN 16z135 High Speed UARTs; SC16IS7xx serial ports; Exynos 5 USB dual-role device (DRD) PHYs; Maxim MAX3421 HCDs (USB-over-SPI); Marvell Armada 375/38x ARM SOC xHCI host controllers; Qualcomm APQ8064 top-level multiplexing (TLMM) blocks; Qualcomm IPQ8064 TLMM blocks; Cadence SPI controllers; X-POWERS AXP20X PMIC regulators; LTC3589, LTC3589-1, and LTC3589-2 regulators; CPU idle has been added for Cirrus Logic CLPS711X SOCs; Synaptics RMI4 touchpads; HDMI support for OMAP5.
Changes visible to kernel developers include:
- The m68k architecture now has early_printk() support for more platforms.
- Lots of cleanup and refactoring has been done in the GPIO subsystem.
- Much work has gone into the multiqueue
block layer; "
3.16 will be a feature complete and performant blk-mq
", Jens Axboe said in his pull request. Multiqueue SCSI will be coming in 3.17. The Micron PCIe flash driver (mtip32xx) has been converted to multiqueue and those changes were merged as well. - Several block layer files have moved from the fs/ and mm/ directories to the block/ directory: bio.c, bio-integrity.c, bounce.c, and ioprio.c.
- Samsung Exynos ARM SoCs now support multi-cluster power management, which allows big.LITTLE CPU switching. There is also support for multi-platform kernels incorporating Exynos, though there is still some driver work to do.
- CONFIG_USB_DEBUG has been removed and all USB drivers have been converted to use the dynamic debug interface.
- The smp_mb__{before,after}_{atomic,clear}_{dec,inc,bit}() family of memory-barrier functions has been substantially reduced, to just two: smp_mb__{before,after}_atomic().
Next week's edition will pick up any merges made after this report. If there are any significant merges after that, we'll write those up for the following week as well.
Locking and pinning
The kernel has long supported the concept of locking pages into physical memory; the mlock() system call is one way to accomplish that. But it turns out that there is more than one way to fix memory in place, and some of those ways have to behave differently than others. The result is confusion with resource accounting and suboptimal memory-management behavior in current kernels. A patch set from Peter Zijlstra may soon straighten things out by formalizing a second type of page locking under the name "pinning."One of the problems with memory locking is that it doesn't quite meet the needs of all users. A page that has been locked into memory with a call like mlock() is required to always be physically present in the system's RAM. At a superficial level, locked pages should thus never cause a page fault when accessed by an application. But there is nothing that requires a locked page to always be present in the same place; the kernel is free to move a locked page if the need arises. Migrating a page will cause a soft page fault (one that is resolved without any I/O) the next time an application tries to access that page. Most of the time, that is not a problem, but developers of hard real-time applications go far out of their way to avoid even the small amount of latency caused by a soft fault. These developers would like a firmer form of locking that is guaranteed to never cause page faults. The kernel does not currently provide that level of memory locking.
Locking also fails to meet the needs of various in-kernel users. In particular, kernel code that uses a range of memory as a DMA buffer needs to know that said memory will not be moved. As a result, the locking mechanism has never been used for these pages; instead, they are fixed in place by incrementing their reference counts or through a call to get_user_pages(). Such pages are effectively fixed in place, though there is no way for the kernel to know that they may be nailed down for a long time.
There is an interesting question that arises with these informally locked pages, though: how do they interact with the resource limit mechanism? The kernel allows an administrator to place an upper bound on the number of pages that a user is able to lock into memory. But, in some cases, the creation of a DMA buffer shared with user space is the result of an application's request. So users can, for all practical purposes, lock pages in memory via actions like the creation of remote DMA (RDMA) buffers; those pages are not currently counted against the limit on locked pages. This irritates administrators and developers who want the limit on locked pages to apply to all locked pages, not just some of them.
These "back door" locked pages also create another sort of problem. Normally, the memory management subsystem goes out of its way to separate pages that can be moved from those that are fixed in place. But, in this case, the pages are often allocated as normal anonymous memory — movable pages, in other words. Fixing them in place makes them unmovable. At that point, they will be in the way any time the memory management code tries to create contiguous ranges of memory by shifting pages around; they are in a place reserved for movable pages, but, being unmovable, they cannot be moved out of the way to make the creation of larger blocks possible.
Peter's patch set tries to address all of these problems — or, at least, to show how they could be addressed. It creates a formal notion of a "pinned" page, being a page that must remain in memory at its current physical location. Pinned pages are kept in a separate virtual memory area (VMA), which is marked with the VM_PINNED flag. Within the kernel, pages can be pinned with the new mm_mpin() function:
int mm_mpin(unsigned long start, size_t len);
This function will pin the pages in memory, but only if the calling process's resource limits allow it. Kernel code that needs to access the pinned memory directly will still need to call get_user_pages(), of course; that call should be done after the call to mm_mpin().
One of the longer-term goals (not part of this patch set) is to make this memory-pinning functionality available to user space. A new mpin() system call would function like mlock(), but with the additional guarantee that the page would never be moved and, thus, would never generate page faults on access. Adding this functionality would mostly appear to be a matter of setting up the system call plumbing.
Another currently unimplemented feature is the migration of the pages to be pinned prior to nailing them down. The mm_mpin() call makes it clear that the pages involved will not be movable in the near future. It would thus make sense for the kernel to shift them out of a movable zone (if that is where they are currently located) and into one of the ranges of memory reserved for non-movable pages. That would prevent pinned pages from interfering with memory compaction and, thus, would facilitate the creation of larger blocks of free memory in those pages' original location.
Finally, putting pinned pages under their own VMA makes it relatively easy to keep track of them. So pinned pages can be counted against the locked-pages limit, eliminating that particular loophole.
Thus far, nobody seems to be overly bothered by this patch set. In previous discussions, there have been concerns that changing the accounting of locked pages could cause regressions on some systems where users are running close to their limits. There are few ways around that problem, though; one could continue to leave pinned pages out of the equation or, perhaps, create a separate limit for them. Neither option has a great deal of appeal, so it may just be that this change will go through as-is.
Another attempt at power-aware scheduling
Numerous attempts to improve the power efficiency of the kernel's CPU scheduler have been made in recent years. Most of these attempts have taken the form of adding heuristics to the scheduler ("group small tasks onto just a few CPUs," for example) that, it was hoped, would lead to more efficient use of the system's resources. These attempts have run aground for a number of reasons, including the fact that they tend to work for only a subset of the range of workloads and systems out there and their lack of integration with other CPU-related power management subsystems, including the CPU frequency and CPU idle governors. At the power-aware scheduling mini-summit in 2013, a call was made for a more organized approach to the problem. Half a year or so later, some of the results are starting to appear.In particular, Morten Rasmussen's Energy cost model for energy-aware scheduling patch set was posted on May 23. This patch set is more of a demonstration-of-concept than something suitable for merging, but it does show the kind of thinking that is going into power-aware scheduling now. Heuristics have been replaced with an attempt to measure and calculate what the power cost of each scheduling decision will be.
The patch set starts by creating a new data structure to describe the available computing capacity of each CPU and the power cost of running at each capacity. If a given CPU can operate at three independent frequencies, this data structure will contain a three-element array describing the power cost of running at each frequency and the associated computing capacity that will be available. There are no specific units associated with either number; as long as they are consistent across the system, things will work.
On a simple system, the cost/capacity array will be the same for each CPU. But things can quickly get more complicated than that. Asymmetric systems (big.LITTLE systems, for example) have both low-power and high-power CPUs offering a wide range of capacities. On larger systems, CPUs are organized into packages and NUMA nodes; the power cost of running two CPUs on the same package will be quite a bit less than the cost of keeping two packages powered up. So the cost/capacity array must be maintained at every level of the scheduling domain hierarchy (which matches the hardware topography), and scheduling decisions must take into account the associated cost at every level.
In the current patch set, this data structure is hard coded for a specific ARM processor. One of the many items on the "to do" list is to create this data structure automatically, either from data found in the firmware or from a device tree. Either way, some architecture-specific code will have to be written, but that was not a problem that needed to be solved to test out the concepts behind this patch set.
With this data structure in place, it is possible to create a simple function:
int energy_diff_util(int cpu, int utilization);
The idea is straightforward enough: return the difference in power consumption that will result from adding a specific load (represented by utilization) to a given CPU. In the real world, though, there are a few difficulties to be dealt with. One of those is that the kernel does not really know how much CPU utilization comes with a specific task. So the patch set has to work with the measured load values, which are not quite the same thing; in particular, load does not take a process's priority into account.
Then there is the little problem that the scheduler does not actually know anything about what the CPU frequency governor is doing with any given CPU. The patch set adds a hack to make the current frequency of each CPU available, and there is an explicit assumption that the governor will make changes to match utilization changes on any given processor. The lack of integration between these subsystems was a major complaint at last year's mini-summit; it is clearly a problem that will need to be addressed as part of any viable power-aware scheduling patch. But, for the time being, it's another detail that can be glossed over while the main concepts are worked out.
There are a number of factors beyond pure CPU usage that can change how much power a given process needs. One of those is CPU wakeups: taking a processor out of a sleep state has an energy cost of its own. It is not possible to know how often a given process will awaken a sleeping CPU, but one can get an approximate measure by tracking how often the process itself wakes up from a sleeping state. If one assumes that some percentage of those wakeups will happen when the CPU itself was sleeping, one can make a guess at how many CPU wakeups will be added if a process is made to run on a given CPU.
So Morten's patch set adds simple process wakeup tracking to get a sense for whether a given process wakes up frequently or rarely. Then, when the time comes to consider running that process on a given CPU, a look at that CPU's current idle time will generate a guess for how many additional wakeups the process would create there. A CPU that is already busy most of the time will not sleep often, so it will suffer fewer wakeups than one that is mostly idle. Factor in the energy cost of waking the CPU (which will depend on just how deeply it is sleeping, another quantity that is hard for the scheduler to get currently) and an approximate energy cost associated with wakeups can be calculated.
With that structure in place, it's just a matter of performing the energy calculations for each possible destination when the time comes to pick a CPU for a given task. Iterating through all CPUs could get expensive, so the code tries to quickly narrow things down to one low-level group of CPUs; the lowest-cost CPU in that group is then chosen. In this patch set, find_idlest_cpu() is modified to do this search; other places where task placement decisions are made (load balancing, for example) have not been modified.
The patch set came with a small amount of benchmark information; it shows energy savings from 3% to 50%, depending on the workload, on a big.LITTLE system. As Morten notes, the savings on a fully symmetric system will be smaller. There is also an approximate quadrupling of the time taken to switch tasks; that cost is likely to be seen as unacceptable, but it should also be possible to reduce that cost considerably with some targeted optimization work.
Thus far, discussion of the patch set has been muted. Getting sufficient reviewer attention on power-aware scheduling patches has been a problem in the past. The tighter focus of this patch set should help to make review relatively easy, though, so, with luck, this work will be looked over in the near future. Then we'll have an idea of whether it represents a viable path forward or not.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jake Edge
Next page:
Distributions>>