The rest of the 5.0 merge window
The most significant changes merged in the last week include:
Architecture-specific
- The C-SKY architecture has gained support for CPU hotplugging, ftrace, and perf.
Core kernel
- There is a new "dynamic events" interface to the tracing subsystem. It unifies the three distinct interfaces (for kprobes, uprobes, and synthetic events) into a single control file. See this patch posting for a brief overview of how this interface works.
Hardware support
- Miscellaneous: NVIDIA Tegra20 external memory controllers, Qualcomm PM8916 watchdog timers, TQ-Systems TQMX86 watchdog timers, MediaTek Command-Queue DMA controllers, UniPhier MIO DMA controllers, Raspberry Pi touchscreens, Amlogic Meson PCIe host controllers, and Socionext UniPhier PCIe controllers.
- Pin control: NXP IMX8QXP pin controllers, Mediatek MT6797 and MT7629 pin controllers, Actions Semi S700 pin controllers, and Renesas RZ/A2 GPIO and pin controllers.
- Support for high-resolution mouse scroll wheels has been significantly improved.
Security
- A small piece of the secure-boot lockdown patch set has landed in the form of additional control over the kexec_load_file() system call. There is a new keyring (called .platform) for keys provided by the platform; it cannot be updated by a running system. Keys in this ring can be used to control which images may be run via kexec_load_file(). It has also become possible for security modules to prevent calls to kexec_load(), which cannot be verified in the same manner.
- The secure computing (seccomp) mechanism can now defer policy decisions to user space. See this new documentation for details on the final version of the API.
- The fscrypt filesystem encryption subsystem has gained support for the Adiantum encryption mode (which was added earlier in the merge window).
- The semantics of the mincore() system call have changed; see below for details.
Internal kernel
- The venerable access_ok() function, which verifies that
an address lies within the user-space region, has lost its first
argument. This argument was either VERIFY_READ or
VERIFY_WRITE depending on the type of access, but no
implementation of access_ok() actually used that
information. The new prototype is:
int access_ok(void *address, int len);
The patch implementing this change ended up modifying over 600 files. There have also been several follow-up patches fixing various issues created by this change.
Changing mincore()
The mincore() system call is used to determine which pages in a
virtual address-space range are currently resident in the page cache; the
idea is to allow an application to learn which of its pages can be accessed
without incurring page faults. As Torvalds notes in this
commit, the intended semantics of this call have always been
"somewhat unclear
", but its behavior all along has been to
indicate which pages are resident in the cache, regardless of whether the
calling process has ever tried to access those pages. In other words,
mincore() would reveal the presence of pages faulted in by other
processes running in the system.
Naturally, it turns out that if you can observe aspects of the system state that are the result of other process's activity, you can use that information to extract information that should be hidden. Daniel Gruss et al. have recently released a paper [PDF] showing how mincore() can be exploited in just this manner. In response, Jiri Kosina posted a patch allowing system administrators to turn mincore() into a privileged system call by way of a sysctl knob, but Torvalds wasn't pleased with that approach. He responded with a patch restricting the information returned by mincore() to anonymous pages and a small subset of file pages.
After Jann Horn pointed out that restricting the query to the calling process's page tables reduces the attack surface considerably, though, Torvalds decided to change his approach. As a result, the patch that was committed adds no new knobs, but does unconditionally restrict mincore() to pages that are actually mapped by the calling process — pages that said process has accessed at some point. That makes it much harder to use mincore() to observe what other processes are doing; as Torvalds pointed out, though, such observation is still theoretically possible, but harder.
So the easy attack is closed, but that additional security may come at the cost of creating problems for user space. As Torvalds noted in the changelog:
I'm hoping that nobody actually has any workflow that cares, and the info leak is real.
If the change breaks code in the wild, it may have to be reverted and some other solution found; for this reason, this patch has not been marked for inclusion into the stable kernels. For those out there who have code that uses mincore(), now would be a good time to test the new semantics to ensure that things still work as expected.
A couple of significant things were not merged before the merge
window closed, including the controversial
fs-verity patch set. Also missing again is the new filesystem mounting API, though some
of the precursor patches did go in toward the end of the merge window.
Unless something surprising happens, the feature set for this cycle is
complete and the 5.0 kernel is now in the stabilization phase, with a final
release expected in late February.
Index entries for this article | |
---|---|
Kernel | Releases/5.0 |
Posted Jan 11, 2019 21:44 UTC (Fri)
by HIGHGuY (subscriber, #62277)
[Link] (1 responses)
If so, madv_willneed/dontneed, readahead and friends, can all help leak this information.
Posted Jan 24, 2019 9:03 UTC (Thu)
by polyp (guest, #53146)
[Link]
The rest of the 5.0 merge window
If one can time how long it takes to access a page, I’m sure you can differentiate between in memory, soft page fault and hard page fault.
The rest of the 5.0 merge window