The conclusion of the 5.14 merge window
Some of the more significant changes pulled in the second half of the 5.14 merge window include:
Architecture-specific
- The s390 architecture now supports booting kernels compressed with the Zstandard (zstd) algorithm.
- The RISC-V architecture has gained support for transparent huge pages and support for the KFENCE memory-safety checker.
Core kernel
- The control-group kill button patch set has been merged; this feature allows the quick killing of all members of a control group by writing to the cgroup.kill virtual control file.
- There are two new options for the madvise()
system call:
- MADV_POPULATE_READ will fault in all pages within the indicated mapping for read access; the effect is the same as if the caller had manually looped through the range, accessing each page. No COW mappings will be broken by this operation.
- MADV_POPULATE_WRITE, instead, will fault in the pages for write access, breaking COW mappings if need be.
The purpose of these operations, in either case, is to pay the cost of faulting in a range of memory immediately, allowing the application to run without page-fault-induced delays later on. They differ from the MAP_POPULATE option to mmap() in that they can be invoked at any time rather than just when the memory is mapped. See this commit for more information.
- The memfd_secret() system call has been merged. It creates a region of memory that is private to the caller; even the kernel cannot directly access it. See this commit for a bit more information.
Filesystems and block I/O
- The ext4 filesystem has gained a new ioctl() command called EXT4_IOC_CHECKPOINT. This command forces all pending transactions out of the journal, and can also overwrite the space on the storage device used by the journal. This operation is part of an effort to prevent information leaks from filesystems. This documentation commit describes the new operation and its options.
- The quotactl_fd() system call has been added. This is the new form of quotactl_path() that was briefly added to 5.13 before being disabled as the result of API concerns.
- The F2FS filesystem can now compress files that are mapped with mmap(). There is also a new nocompress_extension mount option that disables compression for any file whose name matches the given extension(s).
Hardware support
- Clock: Qualcomm MDM9607 global clock controllers, Qualcomm SM6125 global clock controllers, Qualcomm SM8250 camera clock controllers, Renesas RZ/G2L family clock controllers, TI LMK04832 JESD204B-clock jitter cleaners, Ingenic JZ4760 clock controllers, and Huawei Hi3559A clocks.
- Graphics: ITE IT66121 HDMI bridges, ChromeOS EC ANX7688 bridges, Hyper-V synthetic video devices, and TI SN65DSI83 and SN65DSI84 DSI to LVDS bridges. There is also a new "simpledrm" driver that provides a direct-rendering interface for simple framebuffer devices; there are also the inevitable 200,000+ lines of new amdgpu register definitions.
- Industrial I/O: TI TMP117 digital temperature sensors, TI TSC2046 analog-to-digital converters, TAOS TSL2591 ambient light sensors, Murata SCA3300 3-axis accelerometers, Sensirion SPS30 particulate matter sensors, STMicroelectronics LSM9DS0 inertial sensors, NXP FXLS8962AF/FXLS8964AF accelerometers, and Intel quadrature encoders.
- Miscellaneous: Microchip 48L640 EERAM chips, PrimeCell SMC PL351 and PL353 NAND controllers, SparkFun Qwiic joysticks, Richtek RT4831 backlight power controllers, Qualcomm PM8008 power-management ics, Xillybus generic FPGA interfaces for USB, Qualcomm SC7280 interconnects, generic CAN transceivers, Rockchip Innosilicon MIPI CSI PHYs, Allwinner SUN6I hardware spinlocks, and MStar MSC313e watchdogs.
- Pin control: Mediatek MT8365 pin controllers, Qualcomm SM6125 pin controllers, and IDT 79RC3243X GPIO controllers.
- Sound: NXP/Goodix TFA989X (TFA1) amplifiers, Rockchip RK817 audio codecs, and Qualcomm WCD9380/WCD9385 codecs.
- Removals: the "raw" driver, which provided unbuffered access to block devices under /dev/raw, has been removed. Applications needing this sort of access have long since moved to O_DIRECT, or at least that's the belief.
Virtualization and containers
- User-mode Linux now supports PCI drivers with a new PCI-over-virtio driver.
Testing and tracing
- The kunit self-test subsystem now supports running tests under QEMU; see this documentation commit for details.
- There are two new tracing mechanisms in 5.14. The "osnoise" tracer tracks application delays caused by kernel activity — interrupt handling and such. The "timerlat" tracer gives detailed information about delays in timer-based wakeups. The osnoise and timerlat commits have more details and instructions on how to use these features.
The 5.14 kernel is now in the stabilization phase. Unless something highly
unusual happens, the final 5.14 release will happen on August 29 or
September 5. There is a lot of testing and bug-fixing to be done in
the meantime.
Index entries for this article | |
---|---|
Kernel | Releases/5.14 |
Posted Jul 12, 2021 22:07 UTC (Mon)
by abatters (✭ supporter ✭, #6932)
[Link] (5 responses)
mmap(MAP_PRIVATE | MAP_ANONYMOUS)
A long time ago (many kernels ago), I found that prefaulting is needed because just doing a system call like read() and passing the buffer without prefaulting from userspace doesn't always obey mbind() policy. I once tried using mlock() to prefault the pages, but that ignored the mbind() policy also (again with old kernels).
So do these new MADV_POPULATE_* obey mbind() policy?
Posted Jul 13, 2021 8:01 UTC (Tue)
by david.hildenbrand (subscriber, #108299)
[Link] (4 responses)
Are you sure that you are *reading* and not writing? On anonymous memory, reading will simply populate the shared zeropage, so I'd be surprised if it (no populated page vs. populated shared zeropage) makes a real difference when later reading from that mapping (read() ...), or even when writing to it (write() ...) in your example.
mlock(), MAP_POPULATE and the new MADV_POPULATE_READ and MADV_POPULATE_WRITE options nowadays all end up calling handle_mm_fault() -- the very basic fault handler also called on page faults on the faulting CPU. So I'd be surprised if they behave differently-- but I'll double check.
Note that there are subtle differences when it comes to shared mappings: mlock() and MAP_POPULATE won't trigger COW on shared mappings. But for your example, mmap(MAP_PRIVATE | MAP_ANONYMOUS), the mbind() documentation is quite clear: "pages will be allocated only according to the specified policy when the application writes (stores) to the page. For anonymous regions, an initial read access will use a shared page in the kernel containing all zeros. ". And I'd assume that holds for any allocations, also when triggering writes from other CPUs, e.g., as part of a syscall.
Posted Jul 13, 2021 13:17 UTC (Tue)
by abatters (✭ supporter ✭, #6932)
[Link] (2 responses)
Posted Jul 13, 2021 14:16 UTC (Tue)
by david.hildenbrand (subscriber, #108299)
[Link] (1 responses)
In the meantime, I verified that MADV_POPULATE_* and mbind() works as expected.
Posted Jul 13, 2021 14:41 UTC (Tue)
by abatters (✭ supporter ✭, #6932)
[Link]
Posted Jul 16, 2022 14:52 UTC (Sat)
by rockeet (guest, #159726)
[Link]
Posted Jul 12, 2021 23:01 UTC (Mon)
by JohnVonNeumann (guest, #131609)
[Link] (3 responses)
> Nowadays a new system call cost is negligible while it is way
I am a Kernel noob, was just wondering what/if there are downsides to increasing the number of syscalls? Is there a worry about far too much fragmentation amongst syscalls? I guess if I was to make a bad comparison, I'm aware that the x86 instruction set is massive, and people like Chris Domas have done research and found hidden instructions due to the size of the instruction set. Again, I want to reiterate that I know this is a bad example, but I'm just trying to illustrate a point.
Posted Jul 13, 2021 4:51 UTC (Tue)
by Paf (subscriber, #91811)
[Link]
Posted Jul 13, 2021 11:36 UTC (Tue)
by Sesse (subscriber, #53779)
[Link]
Posted Jul 15, 2021 8:00 UTC (Thu)
by maxfragg (subscriber, #122266)
[Link]
for example x86-32 used int80h all syscalls, while some non portable systems might want to avoid dispatching inside the int80h handler and instead spread syscalls over the interupts, then if you run out of interupt numbers, you have a cost increase. Linux uses dispatiching anyways, so there is no big cost to have a thousand syscalls, besides someone having to maintain them all and the desire to basically never break even a single one
MADV_POPULATE_* and mbind()
mbind() to a specific NUMA node
set other madvise flags (MADV_HUGEPAGE, MADV_DONTDUMP, MADV_DONTFORK, etc.)
prefault in the pages manually by looping over the allocation and reading at PAGE_SIZE-intervals
MADV_POPULATE_* and mbind()
MADV_POPULATE_* and mbind()
MADV_POPULATE_* and mbind()
MADV_POPULATE_* and mbind()
MADV_POPULATE_* and mbind()
The conclusion of the 5.14 merge window
> simpler for userspace to deal with a clear-cut system calls than with a
> multiplexer or an overloaded syscall.
The conclusion of the 5.14 merge window
The conclusion of the 5.14 merge window
The conclusion of the 5.14 merge window
Since linux tends to use its own syscall dispatching over old limited syscall mechanisms, were you use a hardware instruction with an immediate syscall number, which tend to be quite limited.