5.4 Merge window, part 2

By Jonathan Corbet
September 30, 2019

The release of the 5.4-rc1 kernel and the closing of the merge window for this development cycle came one day later than would have normally been expected. By that time, 12,554 non-merge changesets had been pulled into the mainline repository; that's nearly 2,900 since the first-week summary was written. That relatively small number of changes belies the amount of interesting change that arrived late in the merge window, though.

Changes merged in the second half of the merge window include:

Architecture-specific

The PowerPC architecture has gained support for an "ultravisor", which is an especially privileged layer of software charged with keeping the hypervisor in line. See this document for details.

Core kernel

There is a new operation, IORING_OP_TIMEOUT, that can be requested from the io_uring subsystem. It will cause the calling process to be woken after the specified timeout period; see this commit for details.

Filesystems and block layer

The dm-verity subsystem can now validate the root hash of a volume using a trusted key in the kernel keyring.
The new dm-clone target makes a copy of an existing read-only device. "The main use case of dm-clone is to clone a potentially remote, high-latency, read-only, archival-type block device into a writable, fast, primary-type device for fast, low-latency I/O". More information can be found in this commit.
The F2FS filesystem has gained support for case-independent file-name lookups. See this commit for some details.
The new "virtiofs" filesystem allows a host to export filesystems efficiently to guest systems. See this document and this commit message for more information.
It's not in 5.4 but worth a mention anyway: Samsung has decided to upstream its internal "sdfat" filesystem; this is a newer implementation of exFAT that, it is said, has fewer code-quality problems and more features. So the exFAT implementation added to the staging tree earlier in the merge window probably has a short life expectancy, at least in its current form.

Hardware support

Clock: Marvell Armada AP CPU clock controllers, MediaTek MT6779 clock controllers, Ingenic JZ47xx TCU clocks and interrupt controllers, and Amlogic Meson virtual realtime clocks.
Miscellaneous: Freescale FlexTimer alarm timers, Macronix raw NAND controllers, Creative SB0540 infrared receivers, Intel Merrifield Basin Cove power-management ICs, NXP IMX7ULP watchdog timers, and Spreadtrum pulse-width modulators.
PCI: Amazon Annapurna Labs PCIe controllers and NVIDIA Tegra194 PCIe controllers.

Memory management

It is now possible to use transparent huge pages for read-only file-mapped virtual memory areas. In practice, for now, this feature only works with executable text sections; an madvise() call is required to turn it on. See this commit for a bit of detail.
There are two new madvise() commands to force the kernel to reclaim specific pages. MADV_COLD moves the indicated pages to the inactive list, essentially marking them unused and suitable targets for page reclaim. A stronger variant is MADV_PAGEOUT, which causes the pages to be reclaimed immediately.
When we last looked at this memory-management performance-regression problem, there was pressure to revert a change reverting a performance-related patch. That revert was reverted for 5.3-rc5; now the revert of the revert has been reverted for 5.4. So the original revert is now in place, and a couple of different patches addressing the original problem have been merged. See this changelog for some more information, along with Linus Torvalds's reasoning for bypassing the memory-management developers and applying these patches directly.

Security-related

The integrity-measurement (IMA) subsystem has gained support for verifying signatures appended to files. It has not, however, gained much in the way of documentation for this feature; what is available can be found in this commit.
After years of work and controversy, the kernel lockdown patch set has been merged in the form of a Linux security module.
In a last-minute move that, seemingly, is responsible for the one-day delay in the release of 5.4-rc1, Torvalds decided to merge an entropy-collection mechanism for random-number generation based on the "jitter entropy" idea. The purpose here is to address the boot-time entropy issues that can cause a system to hang during boot in some situations. This may not be the ultimate form of the solution:

I'm not saying my patch is going to be the last word on the issue. I'm _personally_ ok with it and believe it's not crazy, and if it then makes serious people go "Eww" and send some improvements to it, then it has served its purpose.

Torvalds was clear, though, that he wants to see some sort of solution to the boot-time entropy problem in 5.4.

Internal kernel changes

The build system will now refuse to proceed if the gold linker is detected. There are a few problems that make gold unsuitable for kernel building; see this commit for details.
Support for kernel symbol namespaces has been added, providing a way to bring some order to the many thousands of exported symbols.
The checkpatch.pl tool will now warn about invalid commit IDs in changelogs.

The development community will now focus on stabilizing this work over the next 7-8 weeks, leading to an expected 5.4 release in the second half of November.

Index entries for this article
Kernel	Releases/5.4

5.4 Merge window, part 2

Posted Oct 1, 2019 9:02 UTC (Tue) by meyert (subscriber, #32097) [Link] (7 responses)

But who controls the ultravisor?! :-D

5.4 Merge window, part 2

Posted Oct 1, 2019 9:11 UTC (Tue) by gevaerts (subscriber, #21521) [Link] (1 responses)

The metavisor, of course!

5.4 Merge window, part 2

Posted Oct 1, 2019 11:10 UTC (Tue) by smurf (subscriber, #17840) [Link]

Ah no. It's ultravisors all the way down.

5.4 Merge window, part 2

Posted Oct 1, 2019 10:48 UTC (Tue) by tkreagan (subscriber, #4548) [Link] (1 responses)

turtlevisor. then systemd under that.

5.4 Merge window, part 2

Posted Oct 4, 2019 21:34 UTC (Fri) by mtaht (subscriber, #11087) [Link]

"turtlevisor" made me blow coffee through my nose. thx!

I've been a lonely advocate of a rethink of how we do cpu architectures for a long time, and have called for more hardware support of features essential
to faster context and privilege switching along the lines of what the mill
proposed ( https://millcomputing.com/docs/security/ )

5.4 Merge window, part 2

Posted Oct 1, 2019 12:55 UTC (Tue) by ncultra (✭ supporter ✭, #121511) [Link] (2 responses)

Two key pieces of the design appear problematic from the start.

The "ultravisor" inherits (receives, accepts?) a virtual machine from KVM. At that point, if KVM (and therefore Linux and QEMU) is untrusted, this is "shutting the barn door after all the animals have escaped." I don't doubt that the "ultravisor" would be able to monitor the hypercalls made by the compromised virtual machine from that point onward, encrypt and decrypt its virtual storage, etc., but to what effect, if the guest has already been compromised?

All I/O made by a "secure" virtual machine is virtual I/O. virtio through QEMU is has a history of vulnerabilities, and involves QEMU having shared mappings with the virtual machine. This is problematic. It would be more secure to pass physical device functions directly to the virtual machine and to NOT allow virtual I/O from the secure virtual machine.

It seems as though this is a rube-goldberg-like fix for shared processor and memory side-channel attacks dressed up like a new feature.

5.4 Merge window, part 2

Posted Oct 3, 2019 2:19 UTC (Thu) by roc (subscriber, #30627) [Link] (1 responses)

There's a verification step which presumably uses digital signatures or some other mechanism to ensure that the new guest is in a known good state.

5.4 Merge window, part 2

Posted Oct 4, 2019 8:23 UTC (Fri) by linuxram (guest, #22157) [Link]

Correct. When the Ultravisor moves the pages of the VM from normal memory to secure memory, the ultravisor first checks the integrity of the content. If the check fails, it fails the VM. NOTE: The normal memory is in Hypervisor's control and secure memory is in Ultravisor's control. Hypervisor cannot access secure memory.

Here are some presentations that will explain the architecture better.

https://www.youtube.com/watch?v=pKh_mPPo9X4
https://static.sched.com/hosted_files/openpowerna19/45/Op...
https://www.youtube.com/watch?v=l4jccqc14Vc
https://static.sched.com/hosted_files/kvmforum2018/57/SVM...