-What's new in the Linux kernel - DebConf 2013
+What's new in the Linux kernel - DebConf 2014
@@ -39,7 +39,7 @@
@@ -50,6 +50,7 @@
What's new in the Linux kernel
+
and what's missing in Debian
Ben Hutchings
@@ -58,23 +59,21 @@
Professional software engineer by day, Debian developer by night
+ (or sometimes the other way round)
Regular Linux contributor in both roles since 2008
-
-
- Maintaining a net driver in my day job, plus core networking
- and PCI code as necessary
-
-
- Debian kernel team member, now doing most of the unstable
- maintenance aside from ports
-
-
- Maintaining Linux 3.2.y stable update series on
- kernel.org
-
-
+
+
+ Working on various drivers and kernel code in my day job
+
+
+ Debian kernel team member, now doing most of the unstable
+ maintenance aside from ports
+
+
+ Maintaining Linux 3.2.y stable update series on
+ kernel.org
@@ -85,10 +84,15 @@
Linux is released about 5 times a year (plus stable updates
every week or two)
+
+
+ ...though some features aren't ready to use when they first
+ appear in a release
+
+
- For 'wheezy' we chose to freeze with Linux 3.2, which was
- getting pretty old by the time of release
+ Since my talk last year, Linus has made 6 releases (3.11-3.16)
Good news: we have lots of new kernel features in testing/unstable
@@ -100,85 +104,306 @@
-
Team device driver [3.3]
+
Recap of last year's features (1)
+
+
+ Team device driver: userland package (libteam) was uploaded in
+ October
+
+
+ Transcendent memory: frontswap, zswap and Xen tmem will be
+ enabled in next kernel upload
+
+
+ New KMS drivers: should all work with current Xorg drivers
+
+
+ Module signing: still not enabled, but probably will be if we
+ do Secure Boot
+
+
+
+
+
+
Recap of last year's features (2)
+
+
+ More support for discard: still not enabled at install time
+ (#690977)
+
+
+ More support for containers: XFS was fixed, and user namespaces
+ have been enabled
+
+
+ bcache: userland package (bcache-tools) still not quite ready
+ (#708132)
+
+
+ ARMv7 multiplatform: d-i works on some platforms but
+ I'm still not sure which. Some progress on GPU drivers, but not
+ in Debian yet.
+
+
+
+
+
+
Unnamed temporary files [3.11]
+
+
+ Open directory with option O_TMPFILE to create an
+ unnamed temporary file on that filesystem
+
+
+ As with tmpfile(), the file disappears on
+ last close()
+
+
+ File can be linked into the filesystem using
+ linkat(..., AT_EMPTY_PATH), allowing for 'atomic'
+ creation of file with complete contents and metadata
+
+
+ Not supported on all filesystem types, so you will usually need
+ a fallback
+
+
+
+
+
+
Network busy-polling [3.11] (1)
+
A conventional network request/response process looks like:
+
+
+
+ Task calls send(); network stack constructs a
+ packet; driver adds it to hardware Tx queue
+
+
+ Task calls poll() or recv(), which blocks;
+ kernel puts it to sleep and possibly idles the CPU
+
+
+ Network adapter receives response and generates IRQ, waking
+ up CPU
+
+
+ Driver's IRQ handler schedules polling of the hardware Rx
+ queue (NAPI)
+
+
+ Kernel runs the driver's NAPI poll function, which passes
+ the response packet into the network stack
+
+
+ Network stack decodes packet headers and adds packet to
+ the task's socket
+
+
+ Network stack wakes up sleeping task; scheduler switches
+ to it and the socket call returns
+
+
+
+
+
+
+
Network busy-polling [3.11] (2)
+
+
+ If driver supports busy-polling, it tags each packet with
+ the receiving NAPI context, and kernel tags sockets
+
+
+ When busy-polling is enabled, poll()
+ and recv() call the driver's busy poll function to
+ check for packets synchronously (up to some time limit)
+
+
+ If the response usually arrives quickly, this reduces overall
+ request/response latency as there are no context switches and
+ power transitions
+
+
+ Time limit set by sysctl (net.busy_poll,
+ net.busy_read) or socket option (SOL_SOCKET,
+ SO_BUSY_POLL); requires tuning
+
+
+
+
+
+
Lustre filesystem [3.12]
+
+
+ A distributed filesystem, popular for cluster computing
+ applications
+
+
+ Developed out-of-tree since 1999, but now added to Linux staging
+ directory
+
+
+ Was included in squeeze but dropped from wheezy as it didn't
+ support Linux 3.2
+
+
+ Userland is now missing from Debian
+
+
+
+
+
+
Btrfs offline dedupe [3.12]
- Alternative to the bonding driver - simpler, modular, high-level
- control deferred to userland
+ Btrfs generally copies and frees blocks, rather than updating
+ in-place
+
+
+ This allows snapshots and file copies to copy-by-reference,
+ deferring the real copying until changes are made
+
+
+ Filesystems may still end up with multiple copies of the same
+ file content
- Basic configuration can be done with ip, but it really
- needs new tools - teamd, teamnl, etc.
+ Btrfs doesn't actively merge these duplicates, but userland can
+ tell it to do so
- Want to make it work? See
- http://bugs.debian.org/695850
+ Many file dedupe tools are packaged for Debian, but not one that
+ works with this Btrfs feature, e.g. bedup
-
Transcendent memory [3.0-3.5]
+
nftables [3.13]
- Abstract storage for memory pages, expected to be slower than
- regular memory but faster than disk
+ Linux has several firewall APIs - iptables, ip6tables, arptables
+ and ebtables
+
+
+ All limited to single protocol, and need a kernel module for
+ each match type and each action
- Can provide a second layer of page cache (cleancache and frontswap)
+ Kernel's internal netfilter API is more flexible
- Pages stored by hypervisor (Xen), compressed local memory
- (zcache) or cluster of machines (RAMster)
+ nftables exposes more of this flexibility, allowing userland
+ to provide firewall code for a specialised VM (similar to BPF)
- Not yet enabled in Debian kernels, and needs some thought about
- configuration
+ nftables userland tool uses this API and is already packaged
- Want to make it work? See
- https://lwn.net/Articles/454795/
- and mail debian-kernel
+ Eventually, old APIs will be removed and old userland
+ tools must be ported to use nftables
-
New KMS drivers [3.3-3.10]
+
User-space lockdep [3.14]
+
+
+ Kernel threads and interrupts all run in same address space,
+ using several different synchronisation mechanisms
+
+
+ Easy to introduce bugs that can result in deadlock, but hard to
+ reproduce them
+
+
+ Kernel's 'lockdep' system dynamically tracks locking operations
+ and detects potential deadlocks
+
+
+ Now available as a userland library! Except we need to package
+ it (build from linux-tools source package)
+
+
+
+
+
+
arm64 and ppc64el ports
- DRM/KMS drivers added for old, new and virtual hardware -
- AST, DisplayLink, Hyper-V, Matrox G200, QEMU Cirrus
+ 'arm64' architecture was added in Linux 3.7, but was not yet
+ usable, and no real hardware was available at the time
+
+
+ Upstream Linux arm64 kernel, and Debian packages, should now run
+ on emulators and real hardware
- Should be more robust than purely user-mode drivers, and
- compatible with Secure Boot
+ 'powerpc' architecture has been available for many years,
+ but didn't support kernel running little-endian
- Current X drivers don't work with these, so the kernel drivers
- are disabled for now
+ Linux 3.13 added little-endian kernel support, along with new
+ userland ELF ABI variant - we call it ppc64el
- Want to make it work? Join the X Strike Force and package the
- new X drivers
+ Both ports now being bootstrapped in unstable and are candidates
+ for jessie release
-
Module signing [3.7]
+
File-private locking [3.15]
- Kernel modules can be signed at build time, and the kernel
- configured to refuse loading unsigned modules
+ POSIX says that closing a file descriptor removes
+ the process's locks on that file
+
+
+ What if process has multiple file descriptors for the same
+ file? It loses all locks obtained through any descriptor!
+
+
+ Multithreaded processes may require serialisation around
+ file open/close to ensure they open each file exactly once
+
+
+ Hard and symbolic links can hide that two files are really the
+ same
+
+
+ Linux now provides file-private locks, associated with a
+ specific open file and removed when last descriptor for the
+ open file is closed
+
+
+
+
+
+
Multiqueue block devices [3.16]
+
+
+ Each block device has a command queue (possibly shared with
+ other devices)
+
+
+ Queue may be partly implemented by hardware (NCQ) or only
+ in software
+
+
+ A single queue means initiation is serialised and completion
+ involves IPI - can be bottleneck for fast devices
- Necessary but not sufficient to implement Secure Boot -
- we would also need signed kernel images and some other
- restrictions when booted in this mode
+ High-end SSDs support multiple queues, but kernel needed changes
+ to use them
- Want to make Secure Boot work? Come to the meeting on Tuesday
+ nvme and mtip32xx drivers now support
+ multiqueue, but SCSI drivers don't yet - may be backport-able?