The conclusion of the 5.14 merge window

By Jonathan Corbet
July 12, 2021

The 5.14 merge window closed with the 5.14-rc1 release on July 11. By that time, some 12,981 non-merge changesets had been pulled into the mainline repository; nearly 8,000 of those arrived after the first LWN 5.14 merge-window summary was written. This merge window has thus seen fewer commits than its predecessor, which saw 14,231 changesets before the 5.13-rc1 release. That said, there is still a lot of interesting work that has found its way into the kernel this time around.

Some of the more significant changes pulled in the second half of the 5.14 merge window include:

Architecture-specific

The s390 architecture now supports booting kernels compressed with the Zstandard (zstd) algorithm.
The RISC-V architecture has gained support for transparent huge pages and support for the KFENCE memory-safety checker.

Core kernel

The control-group kill button patch set has been merged; this feature allows the quick killing of all members of a control group by writing to the cgroup.kill virtual control file.
There are two new options for the madvise() system call:
- MADV_POPULATE_READ will fault in all pages within the indicated mapping for read access; the effect is the same as if the caller had manually looped through the range, accessing each page. No COW mappings will be broken by this operation.
- MADV_POPULATE_WRITE, instead, will fault in the pages for write access, breaking COW mappings if need be.
The purpose of these operations, in either case, is to pay the cost of faulting in a range of memory immediately, allowing the application to run without page-fault-induced delays later on. They differ from the MAP_POPULATE option to mmap() in that they can be invoked at any time rather than just when the memory is mapped. See this commit for more information.
The memfd_secret() system call has been merged. It creates a region of memory that is private to the caller; even the kernel cannot directly access it. See this commit for a bit more information.

Filesystems and block I/O

The ext4 filesystem has gained a new ioctl() command called EXT4_IOC_CHECKPOINT. This command forces all pending transactions out of the journal, and can also overwrite the space on the storage device used by the journal. This operation is part of an effort to prevent information leaks from filesystems. This documentation commit describes the new operation and its options.
The quotactl_fd() system call has been added. This is the new form of quotactl_path() that was briefly added to 5.13 before being disabled as the result of API concerns.
The F2FS filesystem can now compress files that are mapped with mmap(). There is also a new nocompress_extension mount option that disables compression for any file whose name matches the given extension(s).

Hardware support

Clock: Qualcomm MDM9607 global clock controllers, Qualcomm SM6125 global clock controllers, Qualcomm SM8250 camera clock controllers, Renesas RZ/G2L family clock controllers, TI LMK04832 JESD204B-clock jitter cleaners, Ingenic JZ4760 clock controllers, and Huawei Hi3559A clocks.
Graphics: ITE IT66121 HDMI bridges, ChromeOS EC ANX7688 bridges, Hyper-V synthetic video devices, and TI SN65DSI83 and SN65DSI84 DSI to LVDS bridges. There is also a new "simpledrm" driver that provides a direct-rendering interface for simple framebuffer devices; there are also the inevitable 200,000+ lines of new amdgpu register definitions.
Industrial I/O: TI TMP117 digital temperature sensors, TI TSC2046 analog-to-digital converters, TAOS TSL2591 ambient light sensors, Murata SCA3300 3-axis accelerometers, Sensirion SPS30 particulate matter sensors, STMicroelectronics LSM9DS0 inertial sensors, NXP FXLS8962AF/FXLS8964AF accelerometers, and Intel quadrature encoders.
Miscellaneous: Microchip 48L640 EERAM chips, PrimeCell SMC PL351 and PL353 NAND controllers, SparkFun Qwiic joysticks, Richtek RT4831 backlight power controllers, Qualcomm PM8008 power-management ics, Xillybus generic FPGA interfaces for USB, Qualcomm SC7280 interconnects, generic CAN transceivers, Rockchip Innosilicon MIPI CSI PHYs, Allwinner SUN6I hardware spinlocks, and MStar MSC313e watchdogs.
Pin control: Mediatek MT8365 pin controllers, Qualcomm SM6125 pin controllers, and IDT 79RC3243X GPIO controllers.
Sound: NXP/Goodix TFA989X (TFA1) amplifiers, Rockchip RK817 audio codecs, and Qualcomm WCD9380/WCD9385 codecs.
Removals: the "raw" driver, which provided unbuffered access to block devices under /dev/raw, has been removed. Applications needing this sort of access have long since moved to O_DIRECT, or at least that's the belief.

Virtualization and containers

User-mode Linux now supports PCI drivers with a new PCI-over-virtio driver.

Testing and tracing

The kunit self-test subsystem now supports running tests under QEMU; see this documentation commit for details.
There are two new tracing mechanisms in 5.14. The "osnoise" tracer tracks application delays caused by kernel activity — interrupt handling and such. The "timerlat" tracer gives detailed information about delays in timer-based wakeups. The osnoise and timerlat commits have more details and instructions on how to use these features.

The 5.14 kernel is now in the stabilization phase. Unless something highly unusual happens, the final 5.14 release will happen on August 29 or September 5. There is a lot of testing and bug-fixing to be done in the meantime.

Index entries for this article
Kernel	Releases/5.14

MADV_POPULATE_* and mbind()

Posted Jul 12, 2021 22:07 UTC (Mon) by abatters (✭ supporter ✭, #6932) [Link] (5 responses)

In some of my programs I allocate memory with specific properties:

mmap(MAP_PRIVATE | MAP_ANONYMOUS)
mbind() to a specific NUMA node
set other madvise flags (MADV_HUGEPAGE, MADV_DONTDUMP, MADV_DONTFORK, etc.)
prefault in the pages manually by looping over the allocation and reading at PAGE_SIZE-intervals

A long time ago (many kernels ago), I found that prefaulting is needed because just doing a system call like read() and passing the buffer without prefaulting from userspace doesn't always obey mbind() policy. I once tried using mlock() to prefault the pages, but that ignored the mbind() policy also (again with old kernels).

So do these new MADV_POPULATE_* obey mbind() policy?

MADV_POPULATE_* and mbind()

Posted Jul 13, 2021 8:01 UTC (Tue) by david.hildenbrand (subscriber, #108299) [Link] (4 responses)

> "by looping over the allocation and reading at PAGE_SIZE-intervals"

Are you sure that you are *reading* and not writing? On anonymous memory, reading will simply populate the shared zeropage, so I'd be surprised if it (no populated page vs. populated shared zeropage) makes a real difference when later reading from that mapping (read() ...), or even when writing to it (write() ...) in your example.

mlock(), MAP_POPULATE and the new MADV_POPULATE_READ and MADV_POPULATE_WRITE options nowadays all end up calling handle_mm_fault() -- the very basic fault handler also called on page faults on the faulting CPU. So I'd be surprised if they behave differently-- but I'll double check.

Note that there are subtle differences when it comes to shared mappings: mlock() and MAP_POPULATE won't trigger COW on shared mappings. But for your example, mmap(MAP_PRIVATE | MAP_ANONYMOUS), the mbind() documentation is quite clear: "pages will be allocated only according to the specified policy when the application writes (stores) to the page. For anonymous regions, an initial read access will use a shared page in the kernel containing all zeros. ". And I'd assume that holds for any allocations, also when triggering writes from other CPUs, e.g., as part of a syscall.

MADV_POPULATE_* and mbind()

Posted Jul 13, 2021 13:17 UTC (Tue) by abatters (✭ supporter ✭, #6932) [Link] (2 responses)

I just double-checked, and you are correct, my code does write to the memory, and the comment even says that it is to break the COW mapping so that the memory is actually allocated, so my previous comment was in error.

MADV_POPULATE_* and mbind()

Posted Jul 13, 2021 14:16 UTC (Tue) by david.hildenbrand (subscriber, #108299) [Link] (1 responses)

Makes sense! QEMU similarly reads+writes one byte of each page when told to preallocate guest memory; the read+write is in place to trigger COW, but to not overwrite existing data, for example, when some piece of guest memory corresponds to a virtual NVDIMM.

In the meantime, I verified that MADV_POPULATE_* and mbind() works as expected.

MADV_POPULATE_* and mbind()

Posted Jul 13, 2021 14:41 UTC (Tue) by abatters (✭ supporter ✭, #6932) [Link]

Thanks for taking the time to look into this!

MADV_POPULATE_* and mbind()

Posted Jul 16, 2022 14:52 UTC (Sat) by rockeet (guest, #159726) [Link]

Is there any difference between MADV_POPULATE_* and mlock?

The conclusion of the 5.14 merge window

Posted Jul 12, 2021 23:01 UTC (Mon) by JohnVonNeumann (guest, #131609) [Link] (3 responses)

Taken from: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...

> Nowadays a new system call cost is negligible while it is way
> simpler for userspace to deal with a clear-cut system calls than with a
> multiplexer or an overloaded syscall.

I am a Kernel noob, was just wondering what/if there are downsides to increasing the number of syscalls? Is there a worry about far too much fragmentation amongst syscalls? I guess if I was to make a bad comparison, I'm aware that the x86 instruction set is massive, and people like Chris Domas have done research and found hidden instructions due to the size of the instruction set. Again, I want to reiterate that I know this is a bad example, but I'm just trying to illustrate a point.

The conclusion of the 5.14 merge window

Posted Jul 13, 2021 4:51 UTC (Tue) by Paf (subscriber, #91811) [Link]

I think if new functionality is desired and can be clearly delineated, then it’s no worse than other systems growing larger. It has costs. But nothing enormous.

The conclusion of the 5.14 merge window

Posted Jul 13, 2021 11:36 UTC (Tue) by Sesse (subscriber, #53779) [Link]

The cost is primarily technical, not really about performance. There might be some small cache effects if you call way too much different code, but it's unlikely to be a big deal.

The conclusion of the 5.14 merge window

Posted Jul 15, 2021 8:00 UTC (Thu) by maxfragg (subscriber, #122266) [Link]

the cost would be higher in an OS less focused on portability than linux.
Since linux tends to use its own syscall dispatching over old limited syscall mechanisms, were you use a hardware instruction with an immediate syscall number, which tend to be quite limited.

for example x86-32 used int80h all syscalls, while some non portable systems might want to avoid dispatching inside the int80h handler and instead spread syscalls over the interupts, then if you run out of interupt numbers, you have a cost increase. Linux uses dispatiching anyways, so there is no big cost to have a thousand syscalls, besides someone having to maintain them all and the desire to basically never break even a single one