The first half of the 6.5 merge window

By Jonathan Corbet
June 30, 2023

The first days of the 6.5 merge window have been a bit calmer than usual, with "only" 4,000 non-merge changesets having been pulled into the mainline repository. Those changesets include a fair amount of significant work, though. Read on for LWN's summary of the first set of changes merged for the next major kernel release.

Architecture-specific

X86 systems can now parallelize much of the process of bringing up all of the CPUs, reducing the time to get all processors online by as much as a factor of ten.
Intel's "Topology Aware Register and PM Capsule Interface" (abbreviated "TPMI") is now supported. This is an interface that provides a better way of managing power-management features.
The arm64 permission-indirection extension is now supported. There is no new functionality resulting from this support now, but it is needed for some upcoming features.

Core kernel

The io_uring subsystem has gained the ability to store the rings and submission queue in user-space memory, rather than having the kernel allocate that memory. This allows user space to allocate the needed memory as huge pages, hopefully improving performance. This changelog has a little more information.
The kernel's Rust support has been upgraded to the Rust 1.68.2 release, the first such upgrade since that support was merged. Other than that, the changes to Rust support were relatively minor this time around; this merge message has the details.
The kernel has gained support for unaccepted memory — the protocol by which secure guest systems accept memory allocated by the host. The merged code includes the (somewhat) controversial protocol to automatically accept all provided memory in the firmware when running a guest kernel without support for memory acceptance.
The BPF subsystem has gained the ability to attach filter functions to kfuncs; the filter can limit the contexts from which the kfunc can be invoked. The initial use is to restrict callers of bpf_sock_destroy() to programs of the BPF_TRACE_ITER type.
Pinning of BPF objects can now be done using O_PATH file descriptors as an alternative to providing the path name for the target directory.

Filesystems and block I/O

It is now possible to mount a filesystem underneath an existing mount on the same mount point; this feature is useful for the provision of seamless updates within containers. See this article, this article, and the merge message for details.
The new cachestat() system call can query the page-cache state of files and directories, allowing user space to determine which of its file pages are currently in RAM. See this article for details and this commit for a man page.

Hardware support

Miscellaneous: Renesas RZ/V2M clocked serial interfaces.
Networking: Fintek F81604 USB to 2CAN interfaces, Microchip LAN865x Rev.B0 10BASE-T1S Internal PHYs, Realtek RTL8192FU interfaces, Realtek 8723DS SDIO wireless network adapters, Realtek 8851BE PCI wireless network adapters, and MediaTek SoC Ethernet PHYs.
Regulator: TI TPS6287x power regulators, TI TPS6594 power-management chips, Rockchip RK806 power-management chips, and Renesas RAA215300 power-management ICs.

Miscellaneous

The nolibc library has gained stack protector support, a number of architecture-specific improvements, and more.

Networking

The passing of process credentials, as done with the SCM_CREDENTIALS control message, has been enhanced with a new SCM_PIDFD type. As might be expected from the name, this message passes a pidfd rather than a process ID. There is also a new SO_PEERPIDFD option to getsockopt() that obtains the pidfd of the peer process.

Security-related

The "secretmem" facility, in the form of the memfd_secret() system call, is now enabled by default. This change was made after some research determined that secretmem use does not hurt performance as had been thought.

Internal kernel changes

The workqueue subsystem will now automatically detect CPU-intensive work items (defined as running for at least 10ms by default) and mark them. This will prevent such items from blocking the execution of other work items. There is a new configuration debugging option to enable the reporting of CPU-intensive work items detected in this way.
The kernel is now built with the -fstrict-flex-arrays=3 compiler option, adding more warnings around the use of flexible arrays. See this article for more details on this work.
The new attribute macro __counted_by() can be used to document which field in a structure contains the number of elements stored in a flexible array (in the same structure). The documentation is useful, but it can also eventually be used for bounds checks as well.

The 6.5 merge window can be expected to remain open until July 9. LWN will be back shortly after that with a summary of the changes pulled in the second half; stay tuned.

Index entries for this article
Kernel	Releases/6.5

The first half of the 6.5 merge window

Posted Jun 30, 2023 15:44 UTC (Fri) by bluca (subscriber, #118303) [Link] (26 responses)

> The passing of process credentials, as done with the SCM_CREDENTIALS control message, has been enhanced with a new SCM_PIDFD type. As might be expected from the name, this message passes a pidfd rather than a process ID. There is also a new SO_PEERPIDFD option to getsockopt() that obtains the pidfd of the peer process.

To add some more details at the risk of blowing my own trumpet, this is a collab between Canonical and Microsoft to enable secure and race-free process tracking in the auth stack, more precisely between systemd, D-Bus and policykit. Together with the kernel side, this is the userspace side which will hopefully follow once the new uapi is released:

https://gitlab.freedesktop.org/dbus/dbus/-/merge_requests...
https://github.com/bus1/dbus-broker/pull/312
https://gitlab.freedesktop.org/polkit/polkit/-/merge_requ...

The first half of the 6.5 merge window

Posted Jun 30, 2023 15:50 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Bit-by-bit we're getting comprehensive pidfd-based process management.

The first half of the 6.5 merge window

Posted Jun 30, 2023 21:32 UTC (Fri) by flussence (guest, #85566) [Link] (22 responses)

Good work! Magic numbers are a scourge that needs to be fixed. I'd go as far as saying sequential fd numbers themselves ought to be next and replaced with opaque handles, but realistically I don't see POSIX systems getting around to that in my lifetime.

The first half of the 6.5 merge window

Posted Jul 1, 2023 2:08 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (18 responses)

It can just be a prctl() setting. Dense file descriptors make no sense for multithreaded programs anyway.

The first half of the 6.5 merge window

Posted Jul 1, 2023 10:22 UTC (Sat) by willy (subscriber, #9762) [Link] (9 responses)

From the kernel side, managing a dense array is far easier than managing a sparse array. It would be nice to be able to allocate a batch of fd space to each thread, so the allocation was only semi-sparse. fork() of such a beast would be a pain, but threaded tasks don't often call fork().

The first half of the 6.5 merge window

Posted Jul 1, 2023 18:19 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

It would be nice to avoid locking in multithreaded apps, especially for resources like memfd(). Something like a simple rbtree would work better than the dense array.

For optimizations, a cache of FDs for each thread might also help.

The first half of the 6.5 merge window

Posted Jul 1, 2023 18:35 UTC (Sat) by willy (subscriber, #9762) [Link] (6 responses)

The rbtree is a TERRIBLE data structure. It's as bad as a linked list in terms of cache misses. B-trees are much better. I need to do a proper article on this with measurements across a few modern CPUs and some historical ones to explain why the rbtree used to be a good choice.

The first half of the 6.5 merge window

Posted Jul 2, 2023 14:50 UTC (Sun) by Paf (subscriber, #91811) [Link]

This would be a lovely explainer to have, for a multitude of reasons

The first half of the 6.5 merge window

Posted Jul 3, 2023 7:23 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Sure. Then use something else.

For example, each thread can get FDs in bulk in sequential blocks of 2^16. The FDs that are given to userspace can be encrypted by RC5 with 32-bit or 64-bit block size to prevent users from guessing the sequential position within the block. This way, allocation can be done locally without any locks.

The first half of the 6.5 merge window

Posted Jul 3, 2023 12:04 UTC (Mon) by kleptog (subscriber, #1183) [Link] (3 responses)

> For example, each thread can get FDs in bulk in sequential blocks of 2^16.

This will fail spectacularly on any program using select(2). I guess you could make it opt-in, but whether you could ensure select(2) not used in any of the libraries you depend on...

The first half of the 6.5 merge window

Posted Jul 3, 2023 13:59 UTC (Mon) by Wol (subscriber, #4433) [Link]

Well, as the manpage says, this is absurdly small for most modern applications, so I guess very few modern applications use it :-)

I think it'll be the classic "roll it out slowly, and fix the breakage".

Cheers,
Wol

The first half of the 6.5 merge window

Posted Jul 3, 2023 15:44 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

select() already fails on any program that uses more than 1024 FDs.

And yes, it has to be an opt-in via prctl() and/or via ELF flags.

The first half of the 6.5 merge window

Posted Jul 4, 2023 13:10 UTC (Tue) by Bigos (subscriber, #96807) [Link]

You can use select() with more than 1024 file descriptors. It's just fd_set is defined as a fixed-length bit array and 1024 seems to be the size in bits (128 bytes) as defined in the system headers (not sure who is the actual source, probably glibc). You can provide bigger bit arrays if you manage them yourself and then cast to fd_set*.

Whether it would be efficient is a different question.

The first half of the 6.5 merge window

Posted Jul 2, 2023 23:03 UTC (Sun) by njs (guest, #40338) [Link]

The nice thing about opaque handles is that the kernel is free to use whatever allocation strategy makes its job easiest.

The first half of the 6.5 merge window

Posted Jul 4, 2023 9:15 UTC (Tue) by walters (subscriber, #7396) [Link] (7 responses)

I think in basically every LWN article that references file descriptors, you pop up and say the same thing =)

Since
https://lwn.net/Articles/863483/
the io_uring work has landed, and it seems to me to be the most practical way forward - more work, but more benefit.

Talking about memfd seems illustrative; today there's no io_uring op to allocate a memfd in the fixed descriptor table, but there could be. And to complete the picture here we need the io_uring spawn https://lwn.net/Articles/908268/ flow to allow passing the memfd to a child process. And io_uring support for setsockopt() for this article.

And we'd need io_uring equivalent of the pidfd APIs.

The first half of the 6.5 merge window

Posted Jul 5, 2023 3:19 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> I think in basically every LWN article that references file descriptors, you pop up and say the same thing =)
Carthago delanda est.

io_uring is a good foot-in-the-door. So at least we now have some infrastructure for handle-based management. A prctl() option would automatically extend it to all syscalls, even the ones that are not yet supported natively in io_uring.

I like io_uring, it is becoming a better syscall interface layer. But right now, it requires a pretty heavy commitment. And it's also hard to compose across the application/library borders.

The first half of the 6.5 merge window

Posted Jul 6, 2023 7:42 UTC (Thu) by walters (subscriber, #7396) [Link] (1 responses)

> Carthago delanda est.

I had to look this up; but I learned something and it's a perfect reply, thanks =)

> And [io_uring is] also hard to compose across the application/library borders.

Yeah, this is a good point.

> A prctl() option would automatically extend it to all syscalls, even the ones that are not yet supported natively in io_uring.

Yeah, ultimately though I think something this involved would require a serious champion; but what might be slightly more of a win than LWN threads is to draft up something like a github gist or a draft PR that shows what you think of the desired interface, then you can just link to it and maybe others can contribute.

To be clear I agree with you; as the file descriptor has evolved to be "kernel API object reference" its historical semantics leave much to be desired. Related to this and your comments about encrypting the fds; I was looking at Cap'n Proto recently and its "capabilities" are pretty nice; https://capnproto.org/rpc.html#distributed-objects It also has somewhat io_uring like semantics in allowing chained operations on those capabilities asynchronously.

The first half of the 6.5 merge window

Posted Jul 6, 2023 23:01 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> I had to look this up; but I learned something and it's a perfect reply, thanks =)

Well, it worked for Cato eventually :)

> Yeah, ultimately though I think something this involved would require a serious champion

I thought about it, but I'm not a serious kernel developer. And this would require some major surgery in the cornerstone parts of Linux. This is not something to do lightly.

The change to discontinuous FDs is not a big deal, we'd just need to modify alloc_fd in file.c to not care about continuity. But then things get trickier and more complicated.

First, FDs won't be stored in a dense array anymore, so they'll have to be stored in a hashtable or some kind of a tree. Not a big deal, but this by itself will likely complicate the code and it won't provide any benefits by itself.

And we want the change to actually have a positive outcome in at least some benchmarks, and that's where complications start to pile up. A whole new infrastructure to allocate FDs will be needed. Ideally, threads should get their own per-thread FD pools to avoid global locking on FD allocation and deallocation. But then you get issues like allocating FDs in one thread and closing them from a different thread, so there'll be a need for some kind of a delayed close queue. In short, we're now talking about replicating what a modern memory allocator does, but for file descriptors.

I believe that it will absolutely result in faster code eventually, but not before a huge amount of work. I can sponsor some of it, but I don't have nearly enough kernel coding experience to pull it off.

The first half of the 6.5 merge window

Posted Jul 6, 2023 10:55 UTC (Thu) by anton (subscriber, #25547) [Link] (3 responses)

Carthago delanda est.

Spelling flames are lame, but still: Ceterum censeo, Carthaginem esse delendam.

The first half of the 6.5 merge window

Posted Jul 6, 2023 22:48 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

I believe "Carthago delanda est" is grammatically correct, although it indeed is not a verbatim quote of Cato.

The first half of the 6.5 merge window

Posted Jul 7, 2023 8:36 UTC (Fri) by anselm (subscriber, #2796) [Link]

It's “delenda”, not “delanda”, but otherwise you're correct. “Carthago delenda est” just means “Carthage must be destroyed” where the longer quote means “In addition, it is my opinion that Carthage must be destroyed”. The story goes that Cato used to finish every speech in the Senate with it, even if the speech was on a completely unrelated topic.

The first half of the 6.5 merge window

Posted Aug 28, 2023 4:17 UTC (Mon) by gutschke (subscriber, #27910) [Link]

Both forms are valid ways of saying "... must be destroyed". But "carthago delenda est" sounds grating and borderline idiomatically incorrect to me; "carthaginem esse delenda" flows much better and is more eloquent.

In fact, it's often the canonical example that students learn when the grammatical structure of an ACI (accusativus cum infinitivo) is first introduced.

The first half of the 6.5 merge window

Posted Jul 1, 2023 13:10 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (1 responses)

There's too much software relying on close+dup or close+open instead of dup2 when you want a specific file descriptor number.

The first half of the 6.5 merge window

Posted Jul 1, 2023 19:54 UTC (Sat) by josh (subscriber, #17465) [Link]

You couldn't do it by default, but you could let applications opt into it.

The first half of the 6.5 merge window

Posted Jul 2, 2023 7:28 UTC (Sun) by jengelh (subscriber, #33263) [Link]

Elsewhere, people wished for 64-bit PIDs (sequential, I presume) so that they do not get re-used in reasonable time, so sequential might not be the big problem in itself. Also, fds are already opaque in that, without more calls, you do not know what they are about (e.g. file/socket). Just like pthread_t.
It will be interesting to see where we will be in 20 years.

The first half of the 6.5 merge window

Posted Jul 1, 2023 6:15 UTC (Sat) by walters (subscriber, #7396) [Link]

Nice work!

The first half of the 6.5 merge window

Posted Jul 2, 2023 7:00 UTC (Sun) by brauner (subscriber, #109349) [Link]

> To add some more details at the risk of blowing my own trumpet, this is a collab between Canonical and Microsoft to enable secure and race-free process tracking in the auth stack, more precisely between systemd, D-Bus and policykit.

Technically this may be a collab but it bothers me a little since our idea's origin has zero to do with any companies. SCM_PIDFD was conceived of years ago but I've never had time to work on it. So a year or more ago we put it onto our public:
https://github.com/uapi-group/kernel-features (see SCM_PIDFD entry) (sometimes questionable) ideas list and voila, Alex showed up eager to implement it with our help which we're super grateful for. And Luca did the excellent userspace part of the work. Otherwise the uptake would probably take a lot longer.