The first half of the 6.5 merge window
Architecture-specific
- X86 systems can now parallelize much of the process of bringing up all of the CPUs, reducing the time to get all processors online by as much as a factor of ten.
- Intel's "Topology Aware Register and PM Capsule Interface" (abbreviated "TPMI") is now supported. This is an interface that provides a better way of managing power-management features.
- The arm64 permission-indirection extension is now supported. There is no new functionality resulting from this support now, but it is needed for some upcoming features.
Core kernel
- The io_uring subsystem has gained the ability to store the rings and submission queue in user-space memory, rather than having the kernel allocate that memory. This allows user space to allocate the needed memory as huge pages, hopefully improving performance. This changelog has a little more information.
- The kernel's Rust support has been upgraded to the Rust 1.68.2 release, the first such upgrade since that support was merged. Other than that, the changes to Rust support were relatively minor this time around; this merge message has the details.
- The kernel has gained support for unaccepted memory — the protocol by which secure guest systems accept memory allocated by the host. The merged code includes the (somewhat) controversial protocol to automatically accept all provided memory in the firmware when running a guest kernel without support for memory acceptance.
- The BPF subsystem has gained the ability to attach filter functions to kfuncs; the filter can limit the contexts from which the kfunc can be invoked. The initial use is to restrict callers of bpf_sock_destroy() to programs of the BPF_TRACE_ITER type.
- Pinning of BPF objects can now be done using O_PATH file descriptors as an alternative to providing the path name for the target directory.
Filesystems and block I/O
- It is now possible to mount a filesystem underneath an existing mount on the same mount point; this feature is useful for the provision of seamless updates within containers. See this article, this article, and the merge message for details.
- The new cachestat() system call can query the page-cache state of files and directories, allowing user space to determine which of its file pages are currently in RAM. See this article for details and this commit for a man page.
Hardware support
- Miscellaneous: Renesas RZ/V2M clocked serial interfaces.
- Networking: Fintek F81604 USB to 2CAN interfaces, Microchip LAN865x Rev.B0 10BASE-T1S Internal PHYs, Realtek RTL8192FU interfaces, Realtek 8723DS SDIO wireless network adapters, Realtek 8851BE PCI wireless network adapters, and MediaTek SoC Ethernet PHYs.
- Regulator: TI TPS6287x power regulators, TI TPS6594 power-management chips, Rockchip RK806 power-management chips, and Renesas RAA215300 power-management ICs.
Miscellaneous
- The nolibc library has gained stack protector support, a number of architecture-specific improvements, and more.
Networking
- The passing of process credentials, as done with the SCM_CREDENTIALS control message, has been enhanced with a new SCM_PIDFD type. As might be expected from the name, this message passes a pidfd rather than a process ID. There is also a new SO_PEERPIDFD option to getsockopt() that obtains the pidfd of the peer process.
Security-related
- The "secretmem" facility, in the form of the memfd_secret() system call, is now enabled by default. This change was made after some research determined that secretmem use does not hurt performance as had been thought.
Internal kernel changes
- The workqueue subsystem will now automatically detect CPU-intensive work items (defined as running for at least 10ms by default) and mark them. This will prevent such items from blocking the execution of other work items. There is a new configuration debugging option to enable the reporting of CPU-intensive work items detected in this way.
- The kernel is now built with the -fstrict-flex-arrays=3 compiler option, adding more warnings around the use of flexible arrays. See this article for more details on this work.
- The new attribute macro __counted_by() can be used to document which field in a structure contains the number of elements stored in a flexible array (in the same structure). The documentation is useful, but it can also eventually be used for bounds checks as well.
The 6.5 merge window can be expected to remain open until July 9.
LWN will be back shortly after that with a summary of the changes pulled in
the second half; stay tuned.
Index entries for this article | |
---|---|
Kernel | Releases/6.5 |
Posted Jun 30, 2023 15:44 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (26 responses)
To add some more details at the risk of blowing my own trumpet, this is a collab between Canonical and Microsoft to enable secure and race-free process tracking in the auth stack, more precisely between systemd, D-Bus and policykit. Together with the kernel side, this is the userspace side which will hopefully follow once the new uapi is released:
https://gitlab.freedesktop.org/dbus/dbus/-/merge_requests...
Posted Jun 30, 2023 15:50 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jun 30, 2023 21:32 UTC (Fri)
by flussence (guest, #85566)
[Link] (22 responses)
Posted Jul 1, 2023 2:08 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (18 responses)
Posted Jul 1, 2023 10:22 UTC (Sat)
by willy (subscriber, #9762)
[Link] (9 responses)
Posted Jul 1, 2023 18:19 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (7 responses)
For optimizations, a cache of FDs for each thread might also help.
Posted Jul 1, 2023 18:35 UTC (Sat)
by willy (subscriber, #9762)
[Link] (6 responses)
Posted Jul 2, 2023 14:50 UTC (Sun)
by Paf (subscriber, #91811)
[Link]
Posted Jul 3, 2023 7:23 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
For example, each thread can get FDs in bulk in sequential blocks of 2^16. The FDs that are given to userspace can be encrypted by RC5 with 32-bit or 64-bit block size to prevent users from guessing the sequential position within the block. This way, allocation can be done locally without any locks.
Posted Jul 3, 2023 12:04 UTC (Mon)
by kleptog (subscriber, #1183)
[Link] (3 responses)
This will fail spectacularly on any program using select(2). I guess you could make it opt-in, but whether you could ensure select(2) not used in any of the libraries you depend on...
Posted Jul 3, 2023 13:59 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
I think it'll be the classic "roll it out slowly, and fix the breakage".
Cheers,
Posted Jul 3, 2023 15:44 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
And yes, it has to be an opt-in via prctl() and/or via ELF flags.
Posted Jul 4, 2023 13:10 UTC (Tue)
by Bigos (subscriber, #96807)
[Link]
Whether it would be efficient is a different question.
Posted Jul 2, 2023 23:03 UTC (Sun)
by njs (guest, #40338)
[Link]
Posted Jul 4, 2023 9:15 UTC (Tue)
by walters (subscriber, #7396)
[Link] (7 responses)
Since
Talking about memfd seems illustrative; today there's no io_uring op to allocate a memfd in the fixed descriptor table, but there could be. And to complete the picture here we need the io_uring spawn https://lwn.net/Articles/908268/ flow to allow passing the memfd to a child process. And io_uring support for setsockopt() for this article.
And we'd need io_uring equivalent of the pidfd APIs.
Posted Jul 5, 2023 3:19 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
io_uring is a good foot-in-the-door. So at least we now have some infrastructure for handle-based management. A prctl() option would automatically extend it to all syscalls, even the ones that are not yet supported natively in io_uring.
I like io_uring, it is becoming a better syscall interface layer. But right now, it requires a pretty heavy commitment. And it's also hard to compose across the application/library borders.
Posted Jul 6, 2023 7:42 UTC (Thu)
by walters (subscriber, #7396)
[Link] (1 responses)
I had to look this up; but I learned something and it's a perfect reply, thanks =)
> And [io_uring is] also hard to compose across the application/library borders.
Yeah, this is a good point.
> A prctl() option would automatically extend it to all syscalls, even the ones that are not yet supported natively in io_uring.
Yeah, ultimately though I think something this involved would require a serious champion; but what might be slightly more of a win than LWN threads is to draft up something like a github gist or a draft PR that shows what you think of the desired interface, then you can just link to it and maybe others can contribute.
To be clear I agree with you; as the file descriptor has evolved to be "kernel API object reference" its historical semantics leave much to be desired. Related to this and your comments about encrypting the fds; I was looking at Cap'n Proto recently and its "capabilities" are pretty nice; https://capnproto.org/rpc.html#distributed-objects It also has somewhat io_uring like semantics in allowing chained operations on those capabilities asynchronously.
Posted Jul 6, 2023 23:01 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Well, it worked for Cato eventually :)
> Yeah, ultimately though I think something this involved would require a serious champion
I thought about it, but I'm not a serious kernel developer. And this would require some major surgery in the cornerstone parts of Linux. This is not something to do lightly.
The change to discontinuous FDs is not a big deal, we'd just need to modify alloc_fd in file.c to not care about continuity. But then things get trickier and more complicated.
First, FDs won't be stored in a dense array anymore, so they'll have to be stored in a hashtable or some kind of a tree. Not a big deal, but this by itself will likely complicate the code and it won't provide any benefits by itself.
And we want the change to actually have a positive outcome in at least some benchmarks, and that's where complications start to pile up. A whole new infrastructure to allocate FDs will be needed. Ideally, threads should get their own per-thread FD pools to avoid global locking on FD allocation and deallocation. But then you get issues like allocating FDs in one thread and closing them from a different thread, so there'll be a need for some kind of a delayed close queue. In short, we're now talking about replicating what a modern memory allocator does, but for file descriptors.
I believe that it will absolutely result in faster code eventually, but not before a huge amount of work. I can sponsor some of it, but I don't have nearly enough kernel coding experience to pull it off.
Posted Jul 6, 2023 10:55 UTC (Thu)
by anton (subscriber, #25547)
[Link] (3 responses)
Posted Jul 6, 2023 22:48 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Jul 7, 2023 8:36 UTC (Fri)
by anselm (subscriber, #2796)
[Link]
It's “delenda”, not “delanda”, but otherwise you're correct. “Carthago delenda est” just means “Carthage must be destroyed” where the longer quote means “In addition, it is my opinion that Carthage must be destroyed”. The story goes that Cato used to finish every speech in the Senate with it, even if the speech was on a completely unrelated topic.
Posted Aug 28, 2023 4:17 UTC (Mon)
by gutschke (subscriber, #27910)
[Link]
In fact, it's often the canonical example that students learn when the grammatical structure of an ACI (accusativus cum infinitivo) is first introduced.
Posted Jul 1, 2023 13:10 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
Posted Jul 1, 2023 19:54 UTC (Sat)
by josh (subscriber, #17465)
[Link]
Posted Jul 2, 2023 7:28 UTC (Sun)
by jengelh (subscriber, #33263)
[Link]
Posted Jul 1, 2023 6:15 UTC (Sat)
by walters (subscriber, #7396)
[Link]
Posted Jul 2, 2023 7:00 UTC (Sun)
by brauner (subscriber, #109349)
[Link]
Technically this may be a collab but it bothers me a little since our idea's origin has zero to do with any companies. SCM_PIDFD was conceived of years ago but I've never had time to work on it. So a year or more ago we put it onto our public:
The first half of the 6.5 merge window
https://github.com/bus1/dbus-broker/pull/312
https://gitlab.freedesktop.org/polkit/polkit/-/merge_requ...
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
Wol
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
https://lwn.net/Articles/863483/
the io_uring work has landed, and it seems to me to be the most practical way forward - more work, but more benefit.
The first half of the 6.5 merge window
Carthago delanda est.
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
Carthago delanda est.
Spelling flames are lame, but still: Ceterum censeo, Carthaginem esse delendam.
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
The first half of the 6.5 merge window
It will be interesting to see where we will be in 20 years.
The first half of the 6.5 merge window
The first half of the 6.5 merge window
https://github.com/uapi-group/kernel-features (see SCM_PIDFD entry) (sometimes questionable) ideas list and voila, Alex showed up eager to implement it with our help which we're super grateful for. And Luca did the excellent userspace part of the work. Otherwise the uptake would probably take a lot longer.