Linux 3.4 has been released on 20 May, 2012.
Summary: This release includes several Btrfs updates: support of metadata blocks bigger than 4KB, much improved metadata performance, better error handling and better recovery tools; there is also a new X32 ABI which allows to run programs in 64-bit mode with 32-bit pointers; several updates to the GPU drivers: early modesetting of Nvidia GeForce 600 'Kepler', support of AMD Radeon 7xxx and AMD Trinity APU series, and support of Intel Medfield graphics; there is also support of x86 CPU driver autoprobing, a device-mapper target that stores cryptographic hashes of blocks to check for intrusions, another target to use external read-only devices as origin source of a thin provisioned LVM volume, several perf improvements such as GTK2 report GUI and a new 'Yama' security module. There are also many small features and new drivers and fixes are also available.
Contents
-
Prominent features in Linux 3.4
- Btrfs updates
- GPU drivers
- New X32 ABI: 64-bit mode with 32-bit pointers
- x86 CPU driver autoprobing
- Verifiable boot path with the device mapper "verity" target
- Support a external read-only device as origin source of a thin provisioned LVM volume
- perf: GTK2 report GUI, better assembly visualization, branch profiling, filtering of users and threads
- 'Yama' security module
- QNX6 filesystem
- Driver and architecture-specific changes
- File systems
- Various core changes
- Memory management
- Networking
- Virtualization
- Crypto
- Security
- Block
- Perf profiling
1. Prominent features in Linux 3.4
1.1. Btrfs updates
This release has many Btrfs updates. Recommended video from Chris Mason, "Btrfs status and new features" (video file in webm, and h.264 format)
1.1.1. Btrfs: Repair and data recovery tools
A new data recovery tool (btrfs-restore) is available. This program doesn't attempts to repair the filesystem, it only tries to pull files from damaged filesystems and copy them to a safe location. Also, the Btrfs filesystem checker (aka fsck) can now repair extent allocation tree corruptions (more repair modes in progress).
1.1.2. Btrfs: Metadata blocks bigger than 4KB
Btrfs was designed from the start to support blocks of multiple sizes, but the code wasn't ready and was disabled, so the Btrfs filesystems used as block size the size of a memory page (4KB in x86). In this version, support for metadata blocks bigger than a page size has been re-enabled, so Btrfs can use metadata blocks of up to 64KB in size (16/32KB seem to work better and are recommended). Support is enabled at mkfs time (f.e. mkfs.btrfs -l 32K). These block sizes cut down the size of the extent allocation tree dramatically and fragments much less. Code: (commit 1, 2)
1.1.3. Btrfs: Performance improvements
Btrfs has improved the performance in several areas. The bigger metadata blocks alone give Btrfs a performance gain, as the extent allocation tree overhead and metadata fragmentation is greatly reduced. But there are other performance improvements: The way Btrfs works with the Linux page cache has been reworked and it's now much faster. CPU usage has been reduced. Also, the copy-on-write mechanisms didn't play well with the Linux VM and forced Btrfs to do many more reads than it should, further tuning has been done to prevent that.
As result of these improvements, the performance of metadata workloads is much faster. In a benchmark consisting in creating 32 million empty files, Btrfs created 170.000 files per second, whereas ext4 and XFS created 110.000 files/second and 115.000 files/second respectively. I/O graphs comparing Btrfs performance in 3.3 and performance in 3.4. Code: (commit), (commit), (commit), (commit), (commit)
1.1.4. Btrfs: Better error handling
Many places of the Btrfs codebase weren't reliable (not because the data could be harmed, the filesystem is designed to keep the data always safe), but because many code functions didn't handle unexpected conditions, instead they would just stop the system by panic'ing it. In this version, Btrfs has been audited to handle these situations correctly: When one of those unexpected errors happens, the current transactions will be aborted, errors will be returned to the userspace callers, and the filesystem will enter in read-only mode, as it is the tradition in Linux. Code: (commit).
1.2. GPU drivers
1.2.1. GPU: Early support of Nvidia GeForce 600 'Kepler'
Nvidia announced new Kepler GPUs (GeForce 600 Series) on 22 March, and that was the day the Nouveau team asked to get basic modesetting support (no 3D, etc) for it merged in the main kernel. A quote from a Nouveau developer: "Its quite amazing that nouveau can support a GPU on its launch day even if its just unaccelerated modesetting". External firmware and updated graphic software stack are required. Code: (commit)
The Nouveau driver has also been "unstaged" and now it's considered ready for widespread use.
1.2.2. GPU: Support for AMD Radeon 7xxx and Trinity APU series
The newest GPU and APUs from AMD (Radeon 7xxx and Trinity APU series) are supported in this version. Code: (commit)
1.2.3. GPU: Support of Intel Medfield graphics
This release adds experimental support for the GMA500 Medfield graphics. Medfield is a embedded architecture targeted for smartphones. Code: (commit)
1.3. New X32 ABI: 64-bit mode with 32-bit pointers
The 64 bits mode of x86 CPUs enlarges the CPU registers to 64 bit, allowing to address larger (>4GB) amounts of memory. This widening, however, has a drawback. Because memory addresses are 64-bit wide, pointers occupy 64 bits of space, the double of space used in 32 bits mode, so binaries compiled for the 64-bit mode are bigger, and when these programs run they use more RAM. And since they are bigger they can cause a performance loss, because with bigger memory addresses, less CPU instructions will fit in the CPU caches.
Some programs have workloads CPU and pointer intensive enough to care about this performance, but with memory requirements not big enough to care about 64-bit memory addressing. They can avoid the 64-bit pointer overhead by just using the 32 bits mode: Processors still allow to run 32-bit operative systems, or run 32-bit programs on top of 64-bit kernels. But this choice also has problems. When a program runs in 32-bit mode, it loses all the other features of the 64-bit mode: larger number of CPU registers, better floating-point performance, faster PIC (position-independent code) shared libraries, function parameters passed via registers, faster syscall instruction...
So a new X32 kernel ABI has been created. A program compiled for this new ABI can run in the 64-bit mode, with all its features, but it uses 32 bits pointers and 32-bit long C type. So applications who need it can enjoy the performance of the 64-bit mode, but with the memory requirements of a 32 bits ABI. Code: (commit)
Recommended LWN article: The x32 system call ABI
Slides from the developers: link
Official X32 coordination site: http://sites.google.com/site/x32abi
1.4. x86 CPU driver autoprobing
There's a growing number of drivers that support a specific x86 feature or CPU. Currently loading these drivers currently on a generic distribution requires various driver specific hacks and it often doesn't work. For example a common issue is not loading the SSE 4.2 accelerated CRC module: this can significantly lower the performance of Btrfs which relies on fast CRC. Another issue is loading the right CPUFREQ driver for the current CPU. Currently distributions often try all all possible driver until one sticks, which is not really a good way to do this.
Linux already has autoprobing mechanisms for drivers, based in kernel notifications and udev. In this release, Linux adds auto probing support for CPU drivers, based on the x86 CPUID information, in particular based on vendor/family/model number and also based on CPUID feature bits. Code: (commit 1), 2, 3, 4, 5, 6, 7, 8, 9)
1.5. Verifiable boot path with the device mapper "verity" target
The device-mapper's "verity" target allows to use a device to store cryptographic hashes of the blocks of a filesystem. This device can be used to check every read attempt to the filesystem, and if the hash of the block doesn't match with the hash of the filesystem, the read fails. This target is used by products such as Chrome OS and Netflix to ensure that the operative system isn't modified, and it can also be used to boot from a known-good device (like a USB drive or CD).
Recommended LWN article: dm-verity
Code: (commit)
1.6. Support a external read-only device as origin source of a thin provisioned LVM volume
Device mapper supports thin provisioning (creation of filesystems larger than the total storage of the disks). Now, it also supports the use of an external read-only device as an origin for the thinly-provisioned volume. Any read to an unprovisioned area of the thin device will be passed through to the origin. Writes trigger the allocation of new blocks as usual.
One use case for this is VM hosts that want to run guests on thinly-provisioned volumes but have the base image on another device (possibly shared between many VMs).
Code: (commit)
1.7. perf: GTK2 report GUI, better assembly visualization, branch profiling, filtering of users and threads
GTK2 report GUI perf report has a simple GTK2-based 'perf report' browser. To launch "perf report" using the new GTK interface just type: "perf report --gtk". The interface is somewhat limited in features at the moment. Code: (commit)
- Better assembly visualization: 'perf annotate' has visual improvements for assembly junkies. It recognizes function calls in the TUI interface, and by hitting enter you can follow the call (recursively) and back, amongst other improvements.
Hardware-based branch profiling: Perf supports a new "hardware-based branch profiling" feature on CPUs that support it (modern x86 Intel CPUs with the 'LBR' hardware feature). This new feature is basically a sophisticated 'magnifying glass' for branch execution. The simplest mode is activated via 'perf record -b', for example "perf record -b any_call,u -e cycles:u branchy-command; perf report -b --sort=symbol". Code: (commit 1, 2, 3, 4, 5, 6)
User and thread filtering: perf now supports a --uid command line option, which can be used to show only the tasks corresponding to a given user, for example perf top --uid 1000. It can also collect events for multiple threads or processes using a comma separated list in the "-p" and "-t" parameters. e.g., perf top -p 21483,21485. Code: (commit), (commit)
1.8. 'Yama' security module
Linux has several security modules: selinux, apparmor, etc. Yama is a new security module that collects a number of system-wide DAC security protections that are not handled by the core kernel itself. For now, Yama restricts the ptrace interface, which allow a process to examine the memory and running state of any of the processes of the same user.
Code: (commit)
1.9. QNX6 filesystem
The qnx6fs is used by newer QNX operating system versions. (e.g. Neutrino). It got introduced in QNX 6.4.0 and is used default since 6.4.1. This release adds read-only support.
Code: (commit)
2. Driver and architecture-specific changes
All the driver and architecture-specific changes can be found in the Linux_3.4_DriverArch page
3. File systems
- ext4
- Btrfs
- FUSE
- NFS
- GFS2
XFS: Scalability improvements for quotas (commit), (commit), (commit), (commit)
CIFS: Introduce credit-based flow control (commit)
HFSplus: Making an HFS Plus partition bootable requires the ability to "bless" a file by putting its inode number in the volume header. Doing this from userspace on a mounted filesystem is impractical since the kernel will write back the original values on unmount. Add an ioctl to allow userspace to update the volume header information based on the target file (commit)
4. Various core changes
A new kernel parameter, "nomodule", will disable module loading (commit)
The pipe2() system call permits a new flag, O_DIRECT, that creates a pipe that operates in "packet" mode. Each write() to the pipe creates a distinct packet, and each read() reads exactly one packet (commit)
The new PR_SET_CHILD_SUBREAPER prctl() operation allows userspace service managers/supervisors mark itself as a sort of 'sub-init', able to stay as the parent for all orphaned processes created by the started services. All SIGCHLD signals will be delivered to the service manager (commit)
Mark thread stack correctly in proc/<pid>/maps (commit)
tty: rework the pty count limits (commit)
kgdb: add the ability to control the reboot (commit)
vfs micro-optimization: use 'unsigned long' accesses for dcache name comparison (commit)
5. Memory management
Make swap-in readahead skip over holes: when the swapped out data has holes, swap in things back in at rates of several MB/second, instead of a few hundred kB/second (commit)
radix-tree micro-optimization: introduce bit-optimized iterator (commit)
6. Networking
New "plug" queuing discipline: allows userspace to plug/unplug a network output queue, using the Netlink interface. When it receives an enqueue command it inserts a plug into the outbound queue that causes following packets to enqueue until a dequeue command arrives over Netlink, causing the plug to be removed and resuming the normal packet flow (commit)
BATMAN: add infrastructure to change routing algorithm (commit)
- TCP
- Socket options
Implement IP_UNICAST_IF and IPV6_UNICAST_IF socket options. They are needed by the Wine project for Windows support (commit 1, 2)
Introduce the SO_PEEK_OFF sock option. This one specifies where to start MSG_PEEK-ing queue data from. When set to negative value means that MSG_PEEK works as usually -- peeks from the head of the queue always (commit)
Support peeking offset for datagram, seqpacket and stream sockets (commit 1, 2)
MSG_TRUNC support for dgram sockets. MSG_TRUNC asks recv() to return the real length of the packet, even when it was longer than the passed buffer (commit)
Add missing getsockopt for SO_NOFCS (commit)
- Netfilter
Add timeout extension. This allows you to attach timeout policies to flow via the connection tracking target (commit), (commit)
ctnetlink: add NAT support for expectations class (commit)
ipset: The "nomatch" keyword and option is added to the hash:*net* types, by which one can add exception entries to sets (commit)
Merge ipt_LOG and ip6_LOG into xt_LOG (commit)
- Bluetooth
7. Virtualization
- KVM
x86: increase recommended max vcpus to 160 (commit)
Allow host IRQ sharing for assigned PCI 2.3 devices (commit)
Infrastructure for software and hardware-based TSC rate (commit)
PPC: Paravirtualize SPRG4-7, ESR, PIR, MASn (commit), e500: MMU API (commit)
s390: "Userspace controlled virtual machines" add parameter for KVM_CREATE_VM (commit)
- Hyper-V
- Xen
virtio-pci: S3 support (commit)
rpmsg: add virtio-based remote processor messaging bus (commit)
8. Crypto
caam: add sha224 and sha384 variants to existing (commit)
camellia: add assembler implementation for x86_64 (commit)
Driver for Tegra AES hardware (commit)
crc32: add slice-by-8 algorithm to existing code (commit)
9. Security
- Apparmor
10. Block
Make cfq_target_latency tunable through sysfs. (commit)
- Device Manager (DM):
11. Perf profiling
ftrace: Add enable/disable ftrace_ops control interface (commit)
perf bench: Allow passing an iteration count to "bench mem (commit)
- perf report
script: Add option resolving vmlinux path (commit)
Adding sysfs group format attribute for pmu device (commit)
Add support to specify pmu style event (commit)
perf ui browser: Add 's' key to filter by symbol name (commit)
Rename "jump labels" to "static keys": Introduce 'struct static_key', (commit)