The second half of the 4.17 merge window

By Jonathan Corbet
April 16, 2018

By the time the 4.17 merge window was closed and 4.17-rc1 was released, 11,769 non-merge changesets had been pulled into the mainline repository. 4.17 thus looks to be a typically busy development cycle, with a merge window only slightly more busy than 4.16 had. Some 6,000 of those changes were pulled after last week's summary was written. There was a lot of the usual maintenance work in those patches (over 10% of those changes were to device-tree files, for example), but also some more significant changes, including:

Core kernel

The CLOCK_MONOTONIC and CLOCK_BOOTTIME clocks used to differ only in that the latter is fast-forwarded after a suspend-and-resume cycle. As of 4.17, CLOCK_MONOTONIC is also moved forward to reflect the time that the system spent suspended. As a result, the two timers are now identical and have been unified within the kernel. Among other things, that change eliminates a potentially surprising behavior wherein the offset between the monotonic and realtime clocks would change after a resume. Thomas Gleixner noted: "There might be side effects in applications, which rely on the (unfortunately) well documented behaviour of the MONOTONIC clock, but the downsides of the existing behaviour are probably worse."
If applications do break, this change may have to be reverted. Meanwhile, there is a new clock (CLOCK_MONOTONIC_ACTIVE) that only advances when the system is actually running.
The new INOTIFY_IOC_SETNEXTWD ioctl() command allows inotify users to specify the number of the descriptor they would like to see returned for the next watch descriptor they create. This is used for checkpoint/restart.
After a few years of waiting, the histogram trigger feature was added to the tracing subsystem. This mechanism enables the easy creation, in kernel space, of histograms from tracing data.
The mmap() system call supports a new MAP_FIXED_NOREPLACE option. Like MAP_FIXED, it tries to place the new memory region at a user-supplied address. Unlike MAP_FIXED, though, it will not replace an existing mapping at that address; instead, it will fail with EEXIST if such a mapping exists. This is the change that was discussed last year as MAP_FIXED_SAFE; it seems that the battle over the proper name for the feature has finally been resolved.

Architecture-specific

The ARM architecture has gained support for the "system control and management interface", or SCMI. It is a set of standards for system management and, in particular power management.
64-Bit PowerPC systems now have the ability to address up to 4PB of memory.
Support for POWER4 processors was accidentally (they swear) broken in 2016, and nobody complained. So support for those processors has been removed entirely on the assumption that nobody is using them anymore.

Filesystems

The overlayfs filesystem can, at times, present different inode numbers for the same file at different times, potentially confusing applications that use those numbers. The "xino" option added for 4.17 will store the filesystem ID in the upper part of the inode number, which allows it to present inode numbers that will not change over time. Some information can be found in Documentation/filesystems/overlayfs.txt.

Security-related

The kernel now supports the Speck cipher, a block cipher that is said to outperform AES on systems without hardware AES support.
AES encryption in Cipher Feedback Mode is now supported; this is required for TPM2 cryptography.
The SM4 symmetric cipher algorithm is supported; it is "an authorized cryptographic algorithm for use within China" according to commit.
The SCTP protocol now has complete SELinux support; see Documentation/security/SELinux-sctp.rst for details.
The AppArmor security module has gained basic support for the control of socket use. See this commit for a little bit of documentation.

Hardware support

Audio: Texas Instruments PCM1789 codecs, AKM AK4458 and AK5558 codecs, Rohm BD28623 codecs, Motorola CPCAP codecs, Maxim MAX9759 speaker amplifiers, ST TDA7419 audio processors, and UniPhier AIO audio subsystems,
Cryptographic: ARM TrustZone CryptoCell security processors and TI Keystone NETCP SA hardware random-number generators.
Industrial I/O: Melexis MLX90632 infrared sensors, Analog Devices AD5272 digital potentiometers, On Semiconductor LV0104CS ambient light sensors, and Microchip MCP4017/18/19 digital potentiometers.
USB: HiSilicon STB SoCs COMB PHYs, AMLogic Meson GXL and GXM USB3 PHYs, STMicroelectronics STM32 USB HS PHY controllers, HiSilicon INNO USB2 PHYs, Motorola Mapphone MDM6600 USB PHYs, Pericom PI3USB30532 Type-C cross switches, ELAN USB touchpads, and devices supporting USB class 3 audio.
Miscellaneous: QCOM on-chip GENI based serial ports, MediaTek SoC gigabit Ethernet controllers, Raspberry Pi 3 GPIO expanders, Nintendo Wii GPIO controllers, Spreadtrum SC9860 platform GPIO controllers, RAVE SP power buttons, PhoenixRC flight controller adapters, HiSilicon hi3660 mailbox controllers, Socionext SynQuacer I2C controllers, Intersil ISL12026 realtime clocks, Nuvoton NPCM750 watchdog timers, Mediatek MT2701 audsys clocks, Allwinner H6 clock controllers, Silicon Labs 544 I2C clock generators, Synopsys DesignWare AXI DMA controllers, and MediaTek High-Speed DMA controllers.

Other

The ABI for 32-bit RDMA users has changed in incompatible ways. The changes are justified with the claim that there are no actual users of the 32-bit mode now, but some may be coming in the future.

Internal kernel changes

The way that system calls are invoked on the x86-64 architecture has been reworked to make it more uniform and flexible. The new scheme has also been designed to prevent unused (but caller-controlled) data from getting onto the call stack — where it could perhaps be used in a speculative-execution attack.
The lexer and parser modules used by the kernel build process are now themselves built on the target system (requiring flex and bison) rather than being shipped in the kernel repository.

As expected, the final diffstat for this merge window shows that more lines of code were deleted than added — 191,000 more. This is only the third time in the kernel's history that a release has been smaller than its predecessor.

Also possibly worthy of note is that the final SCSI pull pushed the kernel repository to over six-million objects. Linus added: "I was joking around that that's when I should switch to 5.0, because 3.0 happened at the 2M mark, and 4.0 happened at 4M objects. But probably not, even if numerology is about as good a reason as any."

This kernel now enters the stabilization process, which will culminate in the final 4.17 (or maybe 5.0?) release in early June.

Index entries for this article
Kernel	Releases/4.17

The second half of the 4.17 merge window

Posted Apr 16, 2018 18:59 UTC (Mon) by josh (subscriber, #17465) [Link] (1 responses)

I'm curious what the expected downsides of the well-documented CLOCK_MONOTONIC behavior are. "the downsides of the existing behaviour are probably worse" doesn't explain why. What does this fix that balances out the potential downside of breaking existing userspace?

The second half of the 4.17 merge window

Posted Apr 17, 2018 17:47 UTC (Tue) by k8to (guest, #15413) [Link]

The obvious problem is code that assumes that the advance of clock_motonic will have some relationship to the advance of time outside the computer. Even if CLOCK_MONOTONIC is well documented to indicate this is not the case, people will assume it anyway.

There may be more subtle problems, and I'd like to hear about them too. Expanding knowledge of errors in time code is kind of valuable because there are so many to make.

Possible side-effects of CLOCK_MONOTONIC change?

Posted Apr 16, 2018 22:48 UTC (Mon) by glenn (subscriber, #102223) [Link] (12 responses)

I fear this change to CLOCK_MONOTONIC may induce floods of activity post-wake, as was the case with Google Chromecast not too long ago: https://www.theregister.co.uk/2018/01/18/chromecast_flood.... Timers set against CLOCK_MONOTONIC would be susceptible, no?

Also, are timers (i.e., timerfd()) against CLOCK_MONOTONIC_ACTIVE supported? If not, my code base may need a lot of rework...

Possible side-effects of CLOCK_MONOTONIC change?

Posted Apr 17, 2018 5:51 UTC (Tue) by epa (subscriber, #39769) [Link] (5 responses)

If I read the article correctly, it could be summarized as “CLOCK_MONOTONIC has been renamed to CLOCK_MONOTONIC_ACTIVE, but the old name has now become an alias for something else”. It all seems a bit strange, especially given the declared intention never to break user space.

Possible side-effects of CLOCK_MONOTONIC change?

Posted Apr 17, 2018 9:05 UTC (Tue) by tglx (subscriber, #31301) [Link] (4 responses)

We are well aware of the fact that it might break user space and prepared for reverting it. In hindsight we should have never introduced CLOCK_BOOTTIME, but back in the days not all architectures were converted to the generic timekeeping infrastructure.

We have discussed that back and forth and finally decided to give it a try. If you or anyone else observes wreckage please let us know immediately.

Possible side-effects of CLOCK_MONOTONIC change?

Posted Apr 17, 2018 9:18 UTC (Tue) by epa (subscriber, #39769) [Link]

OK, makes sense. Maybe the unambiguous new CLOCK_MONOTONIC_ACTIVE should be added first (and backported to stable kernels) so applications that really want that can be prepared for the change. But this is just an ignorant suggestion.

Possible side-effects of CLOCK_MONOTONIC change?

Posted Apr 19, 2018 12:47 UTC (Thu) by lynxeye (subscriber, #90890) [Link]

I didn't observe it yet, but I definitely can see a place where things will break: Most DRM drivers are specifying IOCTL timeouts as absolute timeouts in terms of CLOCK_MONOTONIC. So if GPU operations get suspended and only submitted to the hardware after resume, the userspace will see a lot of its waits time out, while the GPU is still happily working through it's queue of work.

This is unexpected and I bet most of the graphics userspace will fall over if it hits such a condition.

Possible side-effects of CLOCK_MONOTONIC change?

Posted Apr 27, 2018 3:38 UTC (Fri) by njs (guest, #40338) [Link] (1 responses)

Traditionally nanosleep() and the timeouts in select(), epoll_wait(), etc., have all used the CLOCK_MONOTONIC clock, so that if you sleep for 10 seconds, and after 5 seconds the system is suspended for an hour, then after it wakes up again the process keeps sleeping for another 5 seconds.

Did you keep the relationship between sleeping syscalls and CLOCK_MONOTONIC – so that e.g. a nanosleep() before suspend will now wake up immediately on resume? Or did you keep the old sleeping syscall semantics, and break the relationship with CLOCK_MONOTONIC?

As far as I know, all correct event loops currently depend on the assumption that sleeping syscalls and CLOCK_MONOTONIC match each other. For example, if I set a timeout for T seconds from now, the event loop will:

- use (clock_gettime(CLOCK_MONOTONIC) + T) to calculate the absolute time of the timeout
- later, when it calls epoll_wait(), it'll choose the timeout by doing (deadline - clock_gettime(CLOCK_MONOTONIC))
- then it passes that timeout to epoll_wait()

Right now that's sufficient to ensure that epoll_wait() will return when clock_gettime(CLOCK_MONOTONIC) == deadline, or thereabouts... but if CLOCK_MONOTONIC starts counting suspend time, while epoll_wait() doesn't, then we'll start sleeping too long and missing our deadlines by an arbitrary amount.

Or at least, that's what the event loop I maintain does, which is why I want to know :-).

(As an added bonus, if I *do* have to switch to CLOCK_MONOTONIC_ACTIVE, that's going to be a hassle. Currently the event loop is implemented in Python, and the Python standard library obviously doesn't yet have any bindings for CLOCK_MONOTONIC_ACTIVE. Given where we are in the release cycle, the earliest they could be added is 1.5-2 years from now. In the mean time I guess it becomes temporarily impossible to implement an event loop in Python on Linux; you have to write part of it in C, and that's a huge obstacle for distribution :-(.)

Possible side-effects of CLOCK_MONOTONIC change?

Posted Apr 27, 2018 3:52 UTC (Fri) by njs (guest, #40338) [Link]

> In the mean time I guess it becomes temporarily impossible to implement an event loop in Python on Linux; you have to write part of it in C, and that's a huge obstacle for distribution :-(.

On further investigation, it looks like it's not quite as bad as I thought – CLOCK_MONOTONIC_ACTIVE can be queried from Python with:

time.clock_gettime(12)

(Untested, since I don't have a kernel with CLOCK_MONOTONIC_ACTIVE support).

Possible side-effects of CLOCK_MONOTONIC change?

Posted Apr 17, 2018 17:52 UTC (Tue) by k8to (guest, #15413) [Link] (5 responses)

I'm trying to figure this out.

Are we worried that the time jumping forward may expire many timers at once causing programs to do work? That seems correct. It's fairly easy for programs with many expired timers to amortize the cost of doing the work those timers represent, and they probably need to have that logic in place anyway if they hope to self-regulate.

If you're instead worried about many different programs having expiring timers and fighting over resources, that seems like a problem that requires a co-ordinating facility. Grand Central Dispatch from Apple would be one approach. Of course, in a way, the operating system's basic task switching functions are another.

The other option would be some software that thinks it needs to do some work for every interval window, so that if 1000 intervals are passed, it insists on doing 1000 times the work. That behavior is either required (if for example, there's a requirement to look at each time interval's data sample), or is fundamentally broken. I'm not sure how this particular change really affects either of those two situations.

Am I missing something?

Possible side-effects of CLOCK_MONOTONIC change?

Posted Apr 17, 2018 19:59 UTC (Tue) by glenn (subscriber, #102223) [Link] (4 responses)

> Are we worried that the time jumping forward may expire many timers at once causing programs to do work?

This is my concern. I've used CLOCK_MONOTONIC timers to trigger periodic tasks, such as transmit a heartbeat/health-status message, run a watchdog check, etc. Another use-case could be a timer that drives a game loop or animation. The logic surrounding these routines is simple because the (old) CLOCK_MONOTONIC is simple. The software built up around such timers might hide the underlying timer mechanisms (e.g., a timerfd file descriptor), so higher-level application-level software might be unable to reprogram the underlying timer (or cancel it).

Possible side-effects of CLOCK_MONOTONIC change?

Posted Apr 17, 2018 20:07 UTC (Tue) by k8to (guest, #15413) [Link] (3 responses)

But for these scenarios, it's no big deal. Your timers will expire, and you'll send a heartbeat or watchdog check after being asleep for an hour. Maybe your games draw some frames a tiny bit earlier than they need to. It should all settle down rather quickly. For most cases you would want your timers to expire after being asleep an hour.

Possible side-effects of CLOCK_MONOTONIC change?

Posted Apr 17, 2018 20:53 UTC (Tue) by glenn (subscriber, #102223) [Link] (2 responses)

> But for these scenarios, it's no big deal. Your timers will expire, and you'll send a heartbeat or watchdog check after being asleep for an hour. Maybe your games draw some frames a tiny bit earlier than they need to. It should all settle down rather quickly. For most cases you would want your timers to expire after being asleep an hour.

For one-shot timers, I believe that you are correct. My concern is with periodic timers.

Consider the use case of timerfd with a 10Hz periodic timer on CLOCK_MONOTONIC. Your application logic invokes a callback for every increment of the timerfd counter. Before you suspend, the timerfd count is 0---you have no callbacks to execute. You wake from suspension after an hour. The timerfd counter has been fast-forwarded and has a backlogged count of 36,000. If your application logic is simple, you'll invoke your callback in a burst of 36k invocations as you burn the counter back down to zero.

Possible side-effects of CLOCK_MONOTONIC change?

Posted Apr 18, 2018 5:47 UTC (Wed) by k8to (guest, #15413) [Link] (1 responses)

Is this really a problem?

> read(2)
> If the timer has already expired one or more times since its
> settings were last modified using timerfd_settime(), or since
> the last successful read(2), then the buffer given to read(2)
> returns an unsigned 8-byte integer (uint64_t) containing the
> number of expirations that have occurred.

If you get a read() of 36,000 and you execute your logic 36,000 times your program is just busted. Runaway could occur without this quirk.

Possible side-effects of CLOCK_MONOTONIC change?

Posted Apr 18, 2018 17:33 UTC (Wed) by glenn (subscriber, #102223) [Link]

> If you get a read() of 36,000 and you execute your logic 36,000 times your program is just busted. Runaway could occur without this quirk.

That is a fair point. However, this kind of defensive programming was unnecessary under the old CLOCK_MONOTONIC contract. Moreover, if code needs to be updated to detect unexpected timer backlogs, the developer has to make a judgement call on how many backlogged timers are too many: It may not always be clear if a backlog is due to system suspension or if an application is simply unable to service its timers fast enough (either due to its own execution behaviors, or due to those of other processes inducing CPU starvation). Setting a timer against CLOCK_MONOTONIC_ACTIVE may be an easier countermeasure. In either case, userspace has to change.

The second half of the 4.17 merge window

Posted Apr 19, 2018 12:43 UTC (Thu) by clugstj (subscriber, #4020) [Link]

The whole point of CLOCK_MONOTONIC (I thought) was that it didn't change value arbitrarily. It can now jump forward by an arbitrary amount at any point in time (as far as userspace knows). Not exactly a helpful change in my opinion.

The second half of the 4.17 merge window

Posted Apr 26, 2018 9:03 UTC (Thu) by sourcejedi (guest, #45153) [Link]

KVM/qemu is one area (both inside and outside the kernel) that might want to use CLOCK_MONOTONIC_ACTIVE.

https://bugzilla.redhat.com/show_bug.cgi?id=1524412

The second half of the 4.17 merge window

Posted Apr 26, 2018 9:39 UTC (Thu) by tkhai (guest, #99286) [Link]

>INOTIFY_IOC_SETNEXTWD ioctl() command... This is used for checkpoint/restart.

This is not for checkpoint/restart, this is for checkpoint/restore :D