Linux 4.0 has been released on Sun, 12 Apr 2015.
Summary: This release adds support for live patching the kernel code, aimed primarily at fixing security updates without rebooting; DAX, a way to avoid using the kernel cache when filesystems run on systems with persistent memory storage; kasan, a dynamic memory error detector that allows to find use-after-free and out-of-bounds bugs; lazytime, an alternative to relatime, which causes access, modified and changed time updates to only be made in the cache and written to the disk opportunistically; allow overlayfs to have multiple lower layers, support of Parallel NFS server architecture; and dm-crypt CPU scalability improvements. There are also new drivers and many other small improvements.
Contents
1. Prominent features
1.1. Arbitrary version change
This release increases the version to 4.0. This switch from 3.x to 4.0 version numbers is, however, entirely meaningless and it should not be associated to any important changes in the kernel. This release could have been 3.20, but Linus Torvalds just got tired of the old number, made a poll, and changed it. Yes, it is frivolous. The less you think about it, the better.
1.2. Live patching
This release introduces "livepatch", a feature for live patching the kernel code, aimed primarily at systems who want to get security updates without needing to reboot. This feature has been born as result of merging kgraft and kpatch, two attempts by SuSE and Red Hat that where started to replace the now propietary ksplice. It's relatively simple and minimalistic, as it's making use of existing kernel infrastructure (namely ftrace) as much as possible. It's also self-contained and it doesn't hook itself in any other kernel subsystems.
In this release livepatch is not feature complete, yet it provides a basic infrastructure for function "live patching" (i.e. code redirection), including API for kernel modules containing the actual patches, and API/ABI for userspace to be able to operate on the patches (look up what patches are applied, enable/disable them, etc). Most CVEs should be safe to apply this way. Only the x86 architecture is supported in this release, others will follow.
For more details see the merge commit
Sample live patching module: commit
Code commit
1.3. DAX - Direct Access, for persistent memory storage
Before being read by programs, files are usually first copied from the disk to the kernel caches, kept in RAM. But the possible advent of persistent non-volatile memory that would be also be used as disk changes radically the way the kernel deals with this process: the kernel cache would become unnecesary overhead.
Linux has had, in fact, support for this kind of setups since 2.6.13. But the code wasn't maintaned and only supported ext2. In this release, Linux adds DAX (Direct Access, the X is for eXciting). DAX removes the extra copy incurred by the buffer by performing reads and writes directly to the persistent-memory storage device. For file mappings, the storage device is mapped directly into userspace. Support for ext4 has been added.
Recommended LWN article: Supporting filesystems in persistent memory
Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit, commit
1.4. kasan, kernel address sanitizer
Kernel Address sanitizer (KASan) is a dynamic memory error detector. It provides fast and comprehensive solution for finding use-after-free and out-of-bounds bugs. Linux already has the kmemcheck feature, but unlike kmemcheck, KASan uses compile-time instrumentation, which makes it significantly faster than kmemcheck.
The main idea of KASAN is to use shadow memory to record whether each byte of memory is safe to access or not, and use compiler's instrumentation to check the shadow memory on each memory access. Address sanitizer uses 1/8 of the memory addressable in kernel for shadow memory and uses direct mapping with a scale and offset to translate a memory address to its corresponding shadow address.
Code: commit, commit, commit, commit, commit
1.5. "lazytime" option for better update of file timestamps
Unix filesystems keep track of information about files, such as the last time a file was accessed or modified. Keeping track of this information is very expensive, specially the time when a file was accessed ("atime"), which encourages many people to disable it with the mount option "noatime". To alleviate this problem, the "relatime" mount option was added, the atime is only updated if the previous value is earlier than the modification time, or if the file was last accessed more than 24 hours ago. This behaviour, however, breaks some programs that rely on accurate access time tracking to work, and it's also against the POSIX standard.
In this release, Linux adds another alternative: "lazytime". Lazytime causes access, modified and changed time updates to only be made in the cache. The times will only be written to the disk if the inode needs to be updated anyway for some non-time related change, if fsync(), syncfs() or sync() are called, or just before an undeleted inode is evicted from memory. This is POSIX compliant, while at the same time improving the performance.
Recommended LWN article: Introducing lazytime
1.6. Multiple lower layers in overlayfs
In overlayfs, multiple lower layers can now be given using the the colon (":") as a separator character between the directory names. For example:
- mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged
The specified lower directories will be stacked beginning from the rightmost one and going left. In the above example lower1 will be the top, lower2 the middle and lower3 the bottom layer. "upperdir=" and "workdir=" may be omitted, in that case the overlay will be read-only.
1.7. Support Parallel NFS server, default to NFS v4.2
Parallel NFS (pNFS) is a part of the NFS v4.1 standard that allows compute clients to access storage devices directly and in parallel. The pNFS architecture eliminates the scalability and performance issues associated with NFS servers deployed today. This is achieved by the separation of data and metadata, and moving the metadata server out of the data path.
This release adds support for pNFS server, and drivers for the block layout with XFS support to use XFS filesystems as a block layout target, and the flexfiles layout.
Also, in this release the NFS server defaults to NFS v4.2.
Code: commit, commit, commit, commit, commit, commit
1.8. dm-crypt scalability improvements
This release significantly increases the dm-crypt CPU scalability performance thanks to changes that enable effective use of an unbound workqueue across all available CPUs. A large battery of tests were performed to validate these changes, summary of results is available here
Merge: commit
2. Drivers and architectures
All the driver and architecture-specific changes can be found in the Linux_4.0-DriversArch page
3. File systems
- XFS
- EXT4
Support "readonly" filesystem flag to mark a FS image as read-only, tunable with tune2fs. It prevents the kernel and e2fsprogs from changing the image commit
- Btrfs
Add code to support file creation time commit
- NFSv4.1
- UBIFS
- OCFS2
Add a mount option journal_async_commit on ocfs2 filesystem. When this feature is opened, journal commit block can be written to disk without waiting for descriptor blocks, which can improve journal commit performance. Using the fs_mark benchmark, using journal_async_commit shows a 50% improvement commit
Currently in case of append O_DIRECT write (block not allocated yet), ocfs2 will fall back to buffered I/O. This has some disadvantages. In this version, the direct I/O write doesn't fallback to buffer I/O write any more because the allocate blocks are enabled in direct I/O now commit, commit, commit
- F2FS
Introduce a batched trim commit
Support "norecovery" mount option, which is mostly same as "disable_roll_forward". The only difference is that "norecovery" should be activated with read-only mount option. This can be used when user wants to check whether f2fs is mountable or not without any recovery process commit
Add F2FS_IOC_GETVERSION ioctl for getting i_generation from inode, after that, users can list file's generation number by using "lsattr -v commit
4. Block
- Ported to blk-multiqueue
blk-multiqueue: Add support for tag allocation policies and make libata use this blk-mq tagging, instead of rolling their own commit, commit
UBI: Implement UBI_METAONLY, a new open mode for UBI volumes, it indicates that only meta data is being changed commit
5. Core (various)
pstore: Add pmsg - user-space accessible pstore object commit
rcu: Optionally run grace-period kthreads at real-time priority. Recent testing has shown that under heavy load, running RCU's grace-period kthreads at real-time priority can improve performance and reduce the incidence of RCU CPU stall warnings commit
GDB scripts for debugging the kernel. If you load vmlinux into gdb with the option enabled, the helper scripts will be automatically imported by gdb as well, and additional functions are available to analyze a Linux kernel instance. See Documentation/gdb-kernel-debugging.txt for further details commit
Remove CONFIG_INIT_FALLBACK commit
6. Memory management
cgroups: Per memory cgroup slab shrinkers commit
slub: optimize memory alloc/free fastpath by removing preemption on/off commit
Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can detect zero_page in /proc/kpageflags, and then do memory analysis more accurately commit
Make /dev/mem an optional device commit
Add support for resetting peak RSS, which can be retrieved from the VmHWM field in /proc/pid/status, by writing "5" to /proc/pid/clear_refs commit
Show page size in /proc/<pid>/numa_maps as "kernelpagesize_kB" field to help identifying the size of pages that are backing memory areas mapped by a given task. This is specially useful to help differentiating between HUGE and GIGANTIC page backed VMAs commit
geneve: Add Geneve GRO support commit
zsmalloc: add statistics support commit
Incorporate read-only pages into transparent huge pages commit
memcontrol cgroup: Introduce the basic control files to account, partition, and limit memory using cgroups in default hierarchy mode. The old interface will be maintained, but a clearer model and improved workload performance should encourage existing users to switch over to the new one eventually commit
Replace remap_file_pages() syscall with emulation commit
7. Virtualization
KVM: Add generic support for page modification logging, a new feature in Intel "Broadwell" Xeon CPUs that speeds up dirty page tracking commit
vfio: Add device request interface indicating that the device should be released commit
vmxnet3: Make Rx ring 2 size configurable by adjusting rx-jumbo parameter of ethtool -G commit
virtio_net: add software timestamp support commit
virtio_pci: modern driver commit, add an options to disable legacy driver commit, commit
8. Cryptography
aesni: Add support for 192 & 256 bit keys to AES-NI RFC4106 commit
algif_rng: add random number generator support commit
octeon: add MD5 module commit
qat: add support for CBC(AES) ablkcipher commit
9. Security
SELinux : Add security hooks to the Android Binder that enable security modules such as SELinux to implement controls over Binder IPC. The security hooks include support for controlling what process can become the Binder context manager, invoke a binder transaction/IPC to another process, transfer a binder reference to another process , transfer an open file to another process. These hooks have been included in the Android kernel trees since Android 4.3 (commit).
SMACK: secmark support for netfilter (commit).
Device class for TPM, sysfs files are moved from /sys/class/misc/tpmX/device/ to /sys/class/tpm/tpmX/device/ (commit).
10. Tracing & perf
perf mem: Enable sampling loads and stores simultaneously, it could only do one or the other before yet there was no hardware restriction preventing simultaneous collection commit
perf tools: Support parameterized and symbolic events. See links for documentation commit, commit
AMD range breakpoints support: breakpoints are extended to support address range through perf event with initial backend support for AMD extended breakpoints. For example set write breakpoint from 0x1000 to 0x1200 (0x1000 + 512): perf record -e mem:0x1000/512:w commit, commit
11. Networking
TCP: Add the possibility to define a per route/destination congestion control algorithm. This opens up the possibility for a machine with different links to enforce specific congestion control algorithms with optimal strategies for each of them based on their network characteristics commit
Mitigate TCP "ACK loop" DoS scenarios by rate-limiting outgoing duplicate ACKs sent in response to incoming "out of window" segments. For more details, see merge. Code: commit, commit, commit, commit
udpv6: Add lockless sendmsg() support, thus allowing multiple threads to send to a single socket more efficiently commit
ipv4: Automatically bring up DSA master network devices, which allows DSA slave network devices to be used as valid interfaces for e.g: NFS root booting by allowing kernel IP auto-configuration to succeed on these interfaces commit
ipv6: Add sysctl entry(accept_ra_mtu) to disable MTU updates from router advertisements commit
vxlan: Implement supports for the Group Policy VXLAN extension to provide a lightweight and simple security label mechanism across network peers based on VXLAN. It allows further mapping to a SELinux context using SECMARK, to implement ACLs directly with nftables, iptables, OVS, tc, etc commit
vxlan: Add support for remote checksum offload in VXLAN. It is described here. commit
net: openvswitch: Support masked set actions. commit
Infiniband: Add support for extensible query device capabilities verb to allow adding new features commit
Layer 2 Tunneling Protocol (l2tp): multicast notification to the registered listeners when the tunnels/sessions are created/modified/deleted commit
SUNRPC: Set SO_REUSEPORT socket option for TCP connections to bind multiple TCP connections to the same source address+port combination commit
tipc: involve namespace infrastructure commit
802.15.4: introduce support for cca settings commit
- Wireless
Add new GCMP, GCMP-256, CCMP-256, BIP-GMAC-128, BIP-GMAC-256, and BIP-CMAC-256 cipher suites. These new cipher suites were defined in IEEE Std 802.11ac-2013 commit, commit, commit, commit, commit
New NL80211_ATTR_NETNS_FD which allows to set namespace via nl80211 by fd commit
Support per-TID station statistics commit
Allow usermode to query wiphy specific regdom commit
- bridge
- Near Field Communication (NFC)
HCI over NCI protocol support (Some secure elements only understand HCI and thus we need to send them HCI frames) commit
NCI NFCEE (NFC Execution Environment, typically an embedded or external secure element) discovery and enabling/disabling support commit, commit, commit, commit, commit, commit, commit
NFC_EVT_TRANSACTION userspace API addition, it is sent through netlink in order for a specific application running on a secure element to notify userspace of an event commit
Tx timestamps are looped onto the error queue on top of an skb. This mechanism leaks packet headers to processes unless the no-payload options SOF_TIMESTAMPING_OPT_TSONLY is set. A new sysctl (tstamp_allow_data) optionally drops looped timestamp with data. This only affects processes without CAP_NET_RAW commit, commit, commit
- Bluetooth
tc: add BPF-based action. This action provides a possibility to execute custom BPF code commit
net: sched: Introduce connmark action commit
Add Transparent Ethernet Bridging GRO support commit
netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature flag for switch device offloads commit
netfilter: nft_compat: add ebtables support commit
openvswitch: Add support for checksums on UDP tunnels. commit
openvswitch: Support VXLAN Group Policy extension commit
12. List of merges