Linux 4.4 has been released on Sun, 10 Jan 2016.
Summary: This release adds support for 3D support in virtual GPU driver, which allows 3D hardware-accelerated graphics in virtualization guests; loop device support for Direct I/O and Asynchronous I/O, which saves memory and increases performance; support for Open-channel SSDs, which are devices that share the responsibility of the Flash Translation Layer with the operating system; the TCP listener handling is completely lockless and allows for faster and more scalable TCP servers; journalled RAID5 in the MD layer which fixes the RAID write hole; eBPF programs can now be run by unprivileged users, they can be made persistent, and perf has added support for eBPF programs aswell; a new mlock2() syscall that allows users to request memory to be locked on page fault; and block polling support for improved performance in high-end storage devices. There are also new drivers and many other small improvements.
Contents
-
Prominent features
- Faster and leaner loop device with Direct I/O and Asynchronous I/O support
- 3D support in virtual GPU driver
- LightNVM adds support for Open-Channel SSDs
- TCP listener handling completely lockless, making TCP servers faster and more scalable
- Preliminary journalled RAID5 MD support
- Unprivileged eBPF + persistent eBPF programs
- perf + eBPF integration
- Block polling support
- mlock2() syscall allow users to request memory to be locked on page fault
- Drivers and architectures
- Core (various)
- File systems
- Memory management
- Block layer
- Cryptography
- Security
- Tracing and perf tool
- Virtualization
- Networking
- List of merges
- Other news sites
1. Prominent features
1.1. Faster and leaner loop device with Direct I/O and Asynchronous I/O support
This release introduces support of Direct I/O and asynchronous I/O for the loop block device. There are several advantages to use direct I/O and AIO on read/write loop's backing file: double cache is avoided due to Direct I/O which reduces memory usage a lot; unlike user space direct I/O there isn't cost of pinning pages; avoids context switches in some cases because concurrent submissions can be avoided. See commits for benchmarks.
Code: commit, commit, commit, commit, commit
1.2. 3D support in virtual GPU driver
virtio-gpu is a driver for virtualization guests that allows to use the host graphics card efficiently. In this release, it allows the virtualization guest to use the capabilities of the host GPU to accelerate 3D rendering. In practice, this means that a virtualized linux guest can run a opengl game while using the GPU acceleration capabilities of the host, as show in this or this video. This also requires running QEMU 2.5.
44m linux.conf talk about the project
Code: commit
1.3. LightNVM adds support for Open-Channel SSDs
Open-channel SSDs are devices that share responsibilities with the operating system in order to implement and maintain features that typical SSDs keep strictly in firmware. These include the Flash Translation Layer (FTL), bad block management, and hardware units such as the flash controller, the interface controller, and large amounts of flash chips. In this way, Open-channels SSDs exposes direct access to their physical flash storage, while keeping a subset of the internal features of SSDs.
LightNVM is a specification that gives support to Open-channel SSDs. LightNVM allows the host to manage data placement, garbage collection, and parallelism. Device specific responsibilities such as bad block management, FTL extensions to support atomic IOs, or metadata persistence are still handled by the device. This Linux release adds support for lightnvm, (and adds support to NVMe as well).
Recommended LWN article: Taking control of SSDs with LightNVM
Code: commit, commit, commit, commit, commit
1.4. TCP listener handling completely lockless, making TCP servers faster and more scalable
In this release, and as a result from an effort that started two years ago, the TCP implementation has been refactored to make the TCP listener fast path completely lockless. During tests, a server was able to process 3,500,000 SYN packets per second on one listener and still have available CPU cycles - about 2 to 3 order of magnitude what it was possible before. SO_REUSEPORT has also been extended (see Networking section) to add proper CPU/NUMA affinities, so that heavy duty TCP servers can get proper siloing thanks to multi-queues NICs.
1.5. Preliminary journalled RAID5 MD support
This release adds journalled RAID 5 support to the MD (RAID/LVM) layer. With a journal device configured (typically NVRAM or SSD), Data/parity writing to RAID array first writes to the log, then write to raid array disks. If crash happens, we can recovery data from the log. This can speed up RAID resync and fixes RAID5 write hole issue - a crash during degraded operations cannot result in data corruption. In future releases the journal will also be used to improve performance and latency
Code: merge
1.6. Unprivileged eBPF + persistent eBPF programs
Unprivileged eBPF
eBPF programs got its own syscall in Linux 3.18, but until now its use had been restricted to root, because these programs were dangerous for security. eBPF programs are, however, validated by the kernel, and in this release the eBPF verifier has been improved and unprivileged users can use it (although unprivileged eBPF is only meaningful for 'socket filter'-like programs, eBPF programs for tracing and TC classifiers/actions will stay root only). This feature can be switched off with the sysctl kernel.unprivileged_bpf_disabled (once true, bpf programs and maps cannot be accessed from unprivileged process, and the toggle cannot be set back to false)
Recommended LWN article: Unprivileged bpf()
Persistent eBPF maps/progs
This release also adds support for "persistent" eBPF maps/programs. The term "persistent" is to be understood that maps/programs have a facility that lets them survive process termination. This is desired by various eBPF subsystem users, for example: tc classifier/action. Whenever tc parses the ELF object, extracts and loads maps/progs into the kernel, these file descriptors will be out of reach after the tc instance exits, so a subsequent tc invocation won't be able to access/relocate on this resource, and therefore maps cannot easily be shared, f.e. between the ingress and egress networking data path.
To fix issues as these, a new minimal file system has been created that can hold map/prog objects at /sys/fs/bpf/. Any subsequent mounts within a given namespace will point to the same instance. The file system allows for creating a user-defined directory structure. The objects for maps/progs are created/fetched through bpf(2) along with a pathname with two new commands (BPF_OBJ_PIN/BPF_OBJ_GET), that in turn creates the file system nodes. The user can use that to access maps and progs later on, through bpf(2).
1.7. perf + eBPF integration
In this release, eBPF programs have been integrated with perf. When perf is given an eBPF .c source file (or .o file built for the 'bpf' target with clang), will get it automatically built, validated and loaded into the kernel, which can then be used and seen using perf trace and other tools.
Users are allowed to use BPF filter like: # perf record --event ./hello_world.o ls, and the eBPF program is attached to a newly created perf event which works with all tools.
Code: commit, commit, commit, commit, commit, commit, commit, commit, commit, commit
1.8. Block polling support
This release adds basic support for polling for specific IO to complete, which can improve latency and throughput in very fast devices. Currently O_DIRECT sync read/write are supported. This support is only intended for testing, in future releases stats tracking will be used to auto-tune this. For now, for benchmark and testing purposes, we add a sysfs file (io_poll) that controls whether polling is enabled or not.
Recommended LWN article: Block-layer I/O polling
1.9. mlock2() syscall allow users to request memory to be locked on page fault
mlock() allows a user to control page out of program memory, but this comes at the cost of faulting in the entire mapping when it is allocated. For large mappings this is not ideal: For example, security applications that need mlock() are forced to lock an entire buffer, no matter how big it is. Or maybe a large graphical models where the path through the graph is not known until run time, they are forced to lock the entire graph or lock page by page as they are faulted in.
This new mlock2() syscall set creates a middle ground. Pages are marked to be placed on the unevictable LRU (locked) when they are first used, but they are not faulted in by the mlock call. The new system call that takes a flags argument along with the start address and size. This flags argument gives the caller the ability to request memory be locked in the traditional way, or to be locked after the page is faulted in. New calls are added for munlock() and munlockall() which give the called a way to specify which flags are supposed to be cleared. A new MCL flag is added to mirror the lock on fault behavior from mlock() in mlockall(). Finally, a flag for mmap() is added that allows a user to specify that the covered are should not be paged out, but only after the memory has been used the first time.
Recommended LWN article: Deferred memory locking
Code: commit, commit, commit, commit
2. Drivers and architectures
All the driver and architecture-specific changes can be found in the Linux_4.4-DriversArch page
3. Core (various)
process scheduler: Apply a frequency scaling correction factor to per-entity load tracking to make it invariant with respect to CPU frequency. Currently, load appears bigger when the CPU is running at slower frequencies, which affects load-balancing decisions commit, commit
seccomp: add support for dumping a process' (classic BFP) seccomp filters via ptrace + PTRACE_SECCOMP_GET_FILTER commit
watchdog: Mimic the softlockup_panic kernel knob and create a /proc/sys/kernel/hardlockup_panic. It enables a hardlockup to panic the machine commit
watchdog: optionally perform all-CPU backtrace in case of hard lockup. Can be enabled with sysctl /proc/sys/kernel/hardlockup_all_cpu_backtrace commit
coredump: Add two new flags to the existing coredump mechanism for ELF and FDPIC ELF files to allow us to explicitly filter DAX mappings. This is desirable because DAX mappings, like hugetlb mappings, have the potential to be very large commit, commit
test_printf: test printf family at runtime commit
Make sync_file_range(2) use WB_SYNC_NONE writeback. It helps PostgreSQL avoid large latency spikes when flushing data in the background commit
4. File systems
- XFS
- Btrfs
- CIFS
Allow duplicate extents (cp --reflink) in SMB3.0 not just SMB3.1.1 commit
Add resilienthandles mount parameter. Since many servers (Windows clients, and non-clustered servers) do not support persistent handles but do support resilient handles, allow the user to specify a mount option "resilienthandles" in order to get more reliable connections and less chance of data loss (at least when SMB2.1 or later). Default resilient handle timeout (120 seconds to recent Windows server) is used commit
Add support for persistent handles, which are like durable file handles with strong guarantees commit, commit, commit
Allow copy offload (copychunk) across shares commit
- NFS
- ext4
Store checksum seed in superblock commit
- OCFS2
Improve performance for localalloc commit
- UBIFS
atime support commit
5. Memory management
Get rid of vmalloc_info from /proc/meminfo. It is too expensive to calculate and shows up in real workloads, people who actually want to know what the situation is wrt the vmalloc area should just look at the much more complete /proc/vmallocinfo instead commit
Add HugetlbPages field to /proc/PID/status. Currently there's no easy way to get per-process usage of hugetlb pages, which is inconvenient because userspace applications which use hugetlb can need it commit
Add hugetlb-related fields to /proc/PID/smaps to know per-task or per-vma base hugetlb usage: AnonHugePages shows the amount of memory backed by transparent hugepage; Shared_Hugetlb and Private_Hugetlb show the amounts of memory backed by hugetlbfs page which is not counted in RSS or PSS field for historical reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field commit
memcontrol: eliminate memory.current on the root level, because it doesn't add anything that wouldn't be more accurate and detailed using system statistics commit
6. Block layer
loop: direct and asynchronous I/O commit, commit, commit, commit, commit
Add Persistent Reservations support. It includes a user space interface for simplified Persistent Reservations which map to block devices that support these (only SCSI for now). Persistent Reservations allow restricting access to block devices to specific initiators in a shared storage setup commit, commit, commit
Export integrity data interval size in /sys/block/<disk>/integrity/protection_interval_bytes, so that apps can tell whether the interval is different from the device's logical block size commit
cdrom: Random writing support for BD-RE media commit
7. Cryptography
crypto: caam - add support for acipher xts(aes) commit crypto: keywrap - add key wrapping block chaining mode commit crypto: qat - add support for ctr(aes) and xts(aes) commit
8. Security
9. Tracing and perf tool
Integration of perf with eBPF that, given an eBPF .c source file (or .o file built for the 'bpf' target with clang), will get it automatically built, validated and loaded into the kernel via the sys_bpf syscall, which can then be used and seen using 'perf trace' and other tools. Users can run commands like perf record --event bpf-file.c ls to try it commit, commit, commit, commit, commit, commit, commit, commit, commit, commit
Add a new branch type sampling filter to perf record, named 'call' (perf record -j call -e cycles .....), that samples only call branches (function calls), unlike 'any_call' that included direct, indirect calls and far jumps. Only x86 and PowerPC are supported in this release commit, commit
Add Intel cstate (aka idle states) Performance Monitoring Unit support. This allows perf to support cstate related free running (read-only and system-wide) counters. For example, to caculate the fraction of time when the core is running in C6 state: perf stat -x, -e"cstate_core/c6-residency/,msr/tsc/" -C0 -- taskset -c 0 sleep 5 commit
CPU socket filtering: perf tools introduce a new sort type "socket" for the processor socket, eg. perf report --stdio --sort socket,comm,dso,symbol commit. Also, perf report introduces a --socket-filter option for 'perf report' to only show entries for a processor socket that match this filter commit. perf hists browser can zoom in/out for processor socket commit
perf tools: Introduce 'P' modifier, it will cause the event to get maximum possible detected precise level. For example, perf record -e cycles:P ... will detect maximum precise level for 'cycles' event and use it commit
perf tools: Add support for sorting on the iaddr. New sort option is: symbol_iaddr, header label is 'Code Symbol', eg perf mem report --stdio -F +symbol_iaddr commit
perf tools: enables config terms for tracepoint perf events. Valid terms for tracepoint events are 'call-graph' and 'stack-size', so different callgraph settings can be used for each event and eliminate unnecessary overhead. An example for using different call-graph config for each tracepoint: perf record -e syscalls:sys_enter_write/call-graph=fp -e syscalls:sys_exit_write/call-graph=no dd if=/dev/zero of=test bs=4k count=10 commit
perf script: Enable printing of branch stack viaa the 'brstack' and 'brstacksym' arguments to the field selection option -F. The option is off by default and operates only if the perf.data file has branch stack content commit
perf auxtrace: Add AUX area tracing option 'l' to synthesize branch stacks on samples just like sample type PERF_SAMPLE_BRANCH_STACK commit
perf hists browser: Add 'm' key for context menu display commit
perf inject: Add --strip option which is used with --itrace to strip out non-synthesized events commit
perf script: Allow time to be displayed in nanoseconds commit
Intel PT hardware tracer: Accept a zero --itrace period, meaning "as often as possible". In the case of Intel PT that is the same as a period of 1 and a unit of 'instructions' (i.e. --itrace=i1i)commit
Intel PT: Add support for generating branch stack context for PT samples. This is useful for: reporting accurate basic block edge frequencies through the perf report branch view or using with --branch-history to get the wider context of samples. Examples, record with Intel PT: perf record -e intel_pt//u ls
ftrace: add module globbing commit
10. Virtualization
Support for VT-d posted interrupts (i.e. PCI devices can inject interrupts directly into vCPUs). Used by KVM and VFIO commit
KVM: Nested virtualization now supports VPID (same as PCID but for vCPUs) which makes it quite a bit faster commit, commit, commit
KVM: Support for "split irqchip", i.e. LAPIC in kernel and IOAPIC/PIC/PIT in userspace, which reduces the attack surface of the hypervisor commit, commit, commit
KVM: add capability for any-length ioeventfds. With KVM_CAP_IOEVENTFD_ANY_LENGTH, a zero length ioeventfd is allowed, and the kernel will ignore the length of guest write and may get a faster vmexit commit
VMware balloon: Get notified immediately via VMCI when a balloon target is set, instead of waiting for up to one second commit
VMware balloon: Support ballooning with 2 MB sized pages. It significantly reduces the hypervisor side (and guest side) overhead of ballooning and unballooning commit
Vmware vmxnet3: Extend register dump support commit
11. Networking
Add setsockopt() support for SO_INCOMING_CPU and extend SO_REUSEPORT selection logic : If a TCP listener or UDP socket has this option set, a packet is delivered to this socket only if CPU handling the packet matches the specified one. This allows to build very efficient TCP servers, using one listener per RX queue, as the associated TCP listener should only accept flows handled in softirq by the same cpu. This provides optimal NUMA behavior and keep cpu caches hot commit, commit
TCP: Recent ACK (RACK) loss recovery. RACK loss recovery uses the notion of time instead of packet sequence (FACK) or counts (dupthresh) (see commit for details). In the current patch set RACK is only a supplemental loss detection and does not trigger fast recovery. However RACK is being developed to replace or consolidate FACK/dupthresh, early retransmit, and thin-dupack. Since RACK is still experimental, it is now used as a supplemental loss detection on top of existing algorithms. It can be disabled with sysctl net.ipv4.tcp_recovery commit
IP Virtual Server: Support scheduling of ICMP packets to IPVS instances. A new sysctl net.ipv4.vs.schedule_icmp has been introduced, that will enable this feature if set to 1 (by default, it is set by default to 0 to retain the old behaviour) merge commit
IP Virtual Server: Allow to ignore tunnelled packets with new Sysctl net.ipv4.vs.ignore_tunneled. If set, ipvs will set the ipvs_property on all packets which are of unrecognised protocols. This prevents the kernel from routing tunnelled protocols like ipip, which is useful to prevent rescheduling packets that have been tunneled to the ipvs host (i.e. to prevent ipvs routing loops when ipvs is also acting as a real server) commit
Provide FIB table ID in ipv4 route dumps just as ipv6 does commit
IPv4: Hash-based multipath routing. When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed from more or less being destination-based into being quasi-random per-packet scheduling. This increased the risk of out-of-order packets and made it impossible to use multipath together with anycast services. In this release, the multipath routing implementation is replaced with a flow-based load balancing based on a hash over the source and destination addresses merge commit
IPv6 support to the Virtual Routing and Forwarding (VRF) devices commit, commit, commit
IPv4: Currently adding a new ipv4 address always cause the creation of the related network route, with default metric. Add support for IFA_F_NOPREFIXROUTE for ipv4 address. When an address is added with such flag set, no associated network route is created, no network route is deleted when said IP is gone and it's up to the user space manage such route commit
IPv6: gro: support sit protocol commit
Allow the user to ask for the statistics to be filtered out of ipv4/ipv6 address netlink dumps, because many commonly used functions like getifaddrs() invoke RTM_GETLINK to dump the interface information, and do not need the AF_INET6 statistics, which are expensive to calculate commit
bridge: Allow setting the bridge attribute ageing_time in rocker and switchdev commit, commit, commit
vxlan: support both IPv4 and IPv6 sockets in a single vxlan device commit
bridge: complete the bridge device's netlink support and makes it possible to view and configure everything that can be configured via sysfs commit
bridge: Enable adding fdb entries pointing to the bridge device. This can be used to propagate mac address of vlan interfaces configured on top of the vlan filtering bridge commit
Multi Protocol Label Switching (MPLS): Add support for multipath routes commit, commit
bonding: support encapsulated ipv6 TSO commit
Add support for filtering neighbor dumps by master device by adding the NDA_MASTER attribute to the dump request. A new netlink flag, NLM_F_DUMP_FILTERED, is added to indicate the kernel supports the request and output is filtered as requested commit
Add support for filtering neighbor dumps by device by adding the NDA_IFINDEX attribute to the dump request commit
Support for disabling certain features on devices which, when disabled on an upper device, such as a bonding master or a bridge, must be disabled and cannot be re-enabled on underlying devices commit
Introduce L3 Master device abstraction support. It provides glue between core networking code and device drivers to support L3 master devices like VRF commit
dummy: add more features commit
tso: add support for IPv6 commit
netfilter: nfnetlink_log: enables to include the conntrack information together with the packet that is sent to user-space via NFLOG, then a user-space program can acquire NATed information by this NFULA_CT attribute commit
- Wireless
Allow changing station capabilities for unassociated stations commit
Implement Very High Throughput support for mesh networks commit
Make CRDA support optional commit
Advertise support for full station state in AP mode commit
Put current TX power in interface info replies commit
Enable wiphy device to suspend/resume asynchronously commit
ieee802154: experimental netlink support commit
ieee802154: 6lowpan: add tx/rx stats commit
ipconfig: Allow to send Client-identifier in DHCP requests with something like ip=dhcp,client_id_type, client_id_value, as a kernel parameter to enable the kernel to identify itself to the server commit
Add netlink directives and ndo entry to trust VF user. This controls the special permission of VF user. The administrator will dedicatedly trust VF user to use some features which impacts security and/or performance commit
IB: Add support of checksum capability reporting for RC and RAW commit
IB: Add support for network namespaces commit, commit, commit
openvswitch: Add netlink attributes for IPv6 tunnel addresses. This enables IPv6 support for tunnels commit
TIPC: introduce jumbo frame support for broadcast commit
xprtrdma: Enable swap-on-NFS/RDMA commit
12. List of merges
13. Other news sites
LWN merge window part 1 and part 2