Linux 3.3 has been released (official announcement) on 18 Mar 2012.
Summary: This release features as the most important change the merge of kernel code from the Android project. But there is more, it also includes support for a new architecture (TI C6X), much improved balancing and the ability to restripe between different RAID profiles in Btrfs, and several network improvements: a virtual switch implementation (Open vSwitch) designed for virtualization scenarios, a faster and more scalable alternative to the "bonding" driver, a configurable limit to the transmission queue of the network devices to fight bufferbloat, a network priority control group and per-cgroup TCP buffer limits. There are also many small features and new drivers and fixes are also available.
Contents
-
Prominent features in Linux 3.3
- Android merge
- Btrfs: restriping between different RAID levels, improved balancing, improved debugging tools
- Open vSwitch
- Better bonding of network interfaces: teaming
- Bufferbloat fighting: Byte queue limits
- Per-cgroup TCP buffer limits
- Network priority control group
- Better ext4 online resizing
- New architecture: TI C6X
- EFI boot support
- Driver and architecture-specific changes
- Various core changes
- Memory management
- File systems
- Networking
- Virtualization
- Crypto
- Security
- Tracing/profiling
1. Prominent features in Linux 3.3
1.1. Android merge
Recommended LWN article: Bringing Android closer to the mainline
The Android project uses the Linux kernel, but with some modifications and features built by themselves. For a long time, that code has not been merged back to the Linux repositories due to disagreement between developers from both projects. Fortunately, after several years the differences are being ironed out. Various Android subsystems and features have already been merged, and more will follow in the future. This will make things easier for everybody, including the Android mod community, or Linux distributions that want to support Android programs.
Code: (commit), (commit), (commit), (commit)
1.2. Btrfs: restriping between different RAID levels, improved balancing, improved debugging tools
Improved balancing, raid restripping
In Btrfs, a "balance" operation consists in a complete rewrite of the filesystem data, pushing all the rewritten data and metadata through the allocators. This operation is needed in some cases. For example if a new drive is added, a balance operation will be needed to redistribute data to the new drive. This balance operation, however, rebalanced the entire filesystem, which could take many hours, and it didn't support a change of raid profile.
The balancing implementation has been completely reworked. Btrfs can now pause and resume a balance operation, and give status updates. It is also possible to restripe between different raid levels. It also lets filter the balance based on metadata/data profiles, and lets balance only mostly empty block groups. The userspace utilities are available in the "parser" branch of the btrfs-progs.
Code: (commit 1 ,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
Improved debugging
Btrfs has a new debugging utility, "integrity check", aimed at developers. The tool consist in a extra integrity test that for every write request checks that the filesystem is not writing to the disk bogus references that could left the file system in an inconsistent state that would cause data loss. This tool will help Btrfs developers to find bugs more easily.
1.3. Open vSwitch
Recommended LWN article: Routing Open vSwitch into the mainline
Open vSwitch is a software implementation of a multilayer network switch. This project has existed for years and it's now being merged in the main tree. Linux already has a virtual switch (the Linux bridge), but Open vSwitch is designed for more complex scenarios, and specially to be used as a vswitch in virtualized server environments (read the document "Why Open vSwitch?"
Open vSwitch supports standard management interfaces (e.g. sFlow, Netflow, RSPAN, CLI), and is open to programmatic extension and control using Openflow and the OVSDB management protocol, and it is designed to be compatible with modern switching chipsets. See openvswitch.org for more information and userspace utilities.
Code: (commit)
1.4. Better bonding of network interfaces: teaming
There is a new "teaming" network device, which is intended to be a fast, scalable, clean, userspace-driven replacement for the bonding driver. It allows to create virtual interfaces that teams together multiple Ethernet devices. This is typically used to increase the maximum bandwidth and provide redundancy. Currently round-robin and active-backup modes are implemented. The libteam userspace library with couple of demo apps is available at github.com/jpirko/libteam
Code: (commit)
1.5. Bufferbloat fighting: Byte queue limits
Recommended LWN article: Network transmit queue limits
"Bufferbloat" is a term used to describe the latency and throughput problems caused by excessive buffering through the several elements of a network connection. Some tools are being developed to help to alleviate these problems, and this feature is one of them.
Byte queue limits are a configurable limit of packet data that can be put in the transmission queue of a network device. As a result one can tune things such that high priority packets get serviced with a reasonable amount of latency whilst not subjecting the hardware queue to emptying when data is available to send. Configuration of the queue limits is in the tx-<n> sysfs directory for the queue under the byte_queue_limits directory.
1.6. Per-cgroup TCP buffer limits
Recommended LWN article: Per-cgroup TCP buffer limits
This patch introduces memory pressure controls for the TCP protocol which allows to put limits to the size of the buffers used by the TCP code.
Code: (commit 1, 2, 3, 4, 5, 6, 7, 8)
1.7. Network priority control group
The network priority cgroup provides an interface to allow an administrator to dynamically set the priority of network traffic generated by various applications. Nominally, an application would set the priority of its traffic via the SO_PRIORITY socket option. This however, is not always possible. This cgroup allows an administrator to assign a process to a group which defines the priority of egress traffic on a given interface. More details in Documentation/cgroups/net_prio.txt
1.8. Better ext4 online resizing
This release supports a new online resizing ioctl. The new resizing lets kernel do all work, like allocating bitmaps and inode tables, it can support flex_bg and BLOCK_UNINIT features and it's much faster.
Code: (commit)
1.9. New architecture: TI C6X
Recommended LWN article: Upcoming DSP architectures
The family of architectures that run on Linux has got even bigger with the addition of support for the Texas Instruments C6X. This architecture supports members of the Texas Instruments family of C64x single and multicore DSPs. The multicore DSPs do not support cache coherancy, so are not suitable for SMP. Also, these are no-mmu processors. This core architecture is VLIW with an instruction set optimized for DSP applications. For details on the processors, see the TI web page. Also, the project website: linux-c6x.org
Code: (directory)
1.10. EFI boot support
This release introduces an EFI boot stub that allows an x86 bzImage to be loaded and executed directly by EFI firmware. The bzImage appears to the firmware as an EFI application. Both BIOS and EFI boot loaders can still load and run the same bzImage, thereby allowing a single kernel image to work in any boot environment.
Code: (commit)
2. Driver and architecture-specific changes
All the driver and architecture-specific changes can be found in the Linux_3.3_DriverArch page
3. Various core changes
Use jump labels to reduce overhead when the CFS bandwidth control group is disabled (commit)
modules: sysfs - export: taint, coresize, initsize (commit)
Add BLKROTATIONAL ioctl, which permits applications to query whether a block device is rotational (commit)
selftests: new very basic kernel selftests directory (commit)
- proc filesystem
Add hidepid= and gid= mount options. hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories (default). hidepid=1 means users may not access any /proc/<pid>/ directories but their own. hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be fully invisible to other users. gid= defines a group authorized to learn processes information otherwise prohibited by hidepid= (commit)
Introduce the /proc/<pid>/map_files/ directory. This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end", the target is the file. Opening a symlink results in a file that point exactly to the same inode as them vma's one (commit)
Parse mount options (commit)
Add a per-pci-device subdirectory in sysfs called: /sys/bus/pci/devices/<device>/msi_irqs. This sub-directory exports the set of MSI vectors allocated by a given PCI device, by creating a numbered sub-directory for each vector beneath msi_irqs. Currently the only attribute is called mode, which tracks the operational mode of that vector (msi vs. msix) (commit)
Add an "archheaders" build target (commit)
Implement 'sysdev' classes and devices, for "system" devices and buses. It will allow to use udev with them (commit)
Add a few /proc entries and prctl() codes to future checkpoint/restart support (commit 1, 2, 3)
4. Memory management
Memory control group naturalisation, reducing dramatically its memory overhead. Recommended LWN article (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
Compaction combined with Transparent Huge Pages can cause significant stalls with USB sticks or browser. Recommended LWN article (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
There is a limit to the maximum number of dirty pages that exist in the system at any time. However, the per-zone page allocator can fill one zone while other zones are spared. Implement per-zone dirty limits to distribute pages fairly across zones (commit)
Introduce slab_max_order kernel parameter. It determines the maximum allowed order for slabs. A high setting may cause OOMs due to memory fragmentation (commit)
More intensive memory corruption debugging (commit)
- Writeback
A large number of short-lived dirtiers (e.g. gcc instances in a fast kernel build) may starve long-run dirtiers (e.g. dd) as well as pushing the dirty pages to the global hard limit. The solution is to charge the pages dirtied by the exited gcc to the other random dirtying tasks. It sounds not perfect, however should behave good enough in practice (commit)
Control the pause time and the call intervals to balance_dirty_pages() (see commit for more details) (commit)
Avoid dirty tasks getting too much throttling when doing sequential writes smaller than a page (commit), (commit)
Compensate the task's think time when computing the final pause time (commit)
Help to reduce dirty throttling polls and hence CPU overheads. (commit)
The LKP tests see big 56% regression for the case fio_mmap_randwrite_64k. Avoid tiny dirty poll to restore most performance (commit)
5. File systems
- Btrfs
- GFS2
- FUSE
NFSD: Added fault injection (commit)
6. Networking
Support the socket monitoring interface used by the ss tool in UNIX sockets (1 ,2, 3, 4, 5, 6, 7, 8, 9)
Support for the SCSI RDMA Protocol (SRP) Target driver. The SRP protocol is a protocol that allows an initiator to access a block storage device on another host (target) over a network that supports the RDMA protocol. Currently the RDMA protocol is supported by InfiniBand and by iWarp network hardware. More information about the SRP protocol can be found on the website of the INCITS T10 technical committee (commit)
Implementation for the NFC Logical Link Controller protocol. It's also known as NFC peer to peer mode (commit)
6LoWPAN: add fragmentation support (commit), UDP header compression (commit), UDP header decompression (commit)
neigh: new unresolved queue limits: deprecate neigh/default/unres_qlen, replace it with unres_qlen_bytes (commit)
CAIF USB support (commit)
- Netfilter
Add extended accounting infrastructure over nfnetlink, which aims to allow displaying real-time traffic accounting without the need of complicated and resource-consuming implementation in user-space (commit)
Add nfacct match to support extended accounting (commit)
Add "rpfilter" reverse path filter match support, allows to match packets whose replies would go out via the interface the packet came in (commit), (commit)
- Packet scheduler
Adaptative RED AQM for Linux, based on paper from Sally FLoyd, Ramakrishna Gummadi, and Scott Shenker (commit)
vlan: add 802.1q netpoll support(commit)
bridge: add NTF_USE support(commit)
Add wireless TX status socket option (commit)
7. Virtualization
KVM: Expose a version 2 architectural PMU (Performance Monitoring Unit) to a guest (commit)
- Xen
8. Crypto
caam - add support for MD5 algorithm variants (commit)
Digital signature verification support (commit)
Multiprecision maths library from GnuPG: used to implement RSA digital signature verification, which is used by IMA/EVM digital signature extension (commit 1, 2, 3, 4)
serpent - add 4-way parallel i586/SSE2 assembler (commit), add 8-way parallel x86_64/SSE2 assembler(commit)
serpent-sse2 - add lrw support (commit), add xts support(commit)
talitos - add hmac algorithms (commit)
twofish-x86_64-3way - add xts support (commit)
9. Security
- audit
evm: digital signature verification support(commit)
10. Tracing/profiling
- perf