There are various uses for this sort of introspection. One is to answer the question: what capabilities does process X have in namespace Y? The rules that determine the answer to that question have been documented in the user_namespaces(7) manual page for quite a while, but until now, there was no way of empirically answering that question with respect to a particular process and a particular namespace on a running system. This changes in Linux 4.9, thanks to work that Andrei Vagin did after I asked about this possibility on the Linux kernel mailing list back in July.
The solution, suggested by Eric Biederman, is rather elegant (even if implemented as ioctl() operations), and is based on returning file descriptors referring to objects in the (unmounted) namespace filesystem (NSFS). Given a file descriptor, fd, that refers to one of the /proc/PID/ns/xxxx symbolic links, two operations can be performed:
- ioctl(fd, NS_GET_USERNS): Returns a file descriptor that refers to the owning user namespace for the namespace referred to by fd.
- ioctl(fd, NS_GET_PARENT): Returns a file descriptor that refers to the parent namespace for the namespace referred to by by fd. This operation can be applied only to hierarchical namespaces (PID namespaces and user namespaces). This operation may fail if the parent namespace is outside the namespace scope of the caller. This might be the case if, for example, the parent of a PID namespace is an ancestor namespace of the caller's PID namespace. In addition, this error can occur when trying to find the parent of the initial PID or username space. When working our way backward through the chain of ancestors of a namespace, this fact can be used to determine whether we have reached the initial namespace.
Another possible use of this feature is to introspect across all processes on the system to discover the PID and user namespace hierarchies on a live system. (And also to discover the relationship of non-user namespaces to their owning user namespaces.)
The following Go program provides an example of such introspection. It inspects the /proc/PID/ns/user files for all processes on the system and builds up a map of the user namespace hierarchy along with the processes that reside in each namespace.
The program is fairly well commented, so without further explanation, I'll just present the code. (I should add that this is my first attempt at using Go (a nice language!), so the code may not be idiomatic, and may also have some errors, but it should serve to illustrate what's going on.) An example run is shown below. The program code can be found in the code tarball available for down on my website.
/* userns_overview.go
Display a hierarchical view of the user namespaces on the
system along with the member processes for each namespace.
This requires features new in Linux 4.9. See the
namespaces(7) man page.
(http://man7.org/linux/man-pages/man7/namespaces.7.html)
*/
package main
import (
"fmt"
"io/ioutil"
"os"
"sort"
"strconv"
"strings"
"syscall"
"unsafe"
)
// A namespace is identified by device ID and inode number
type NamespaceID struct {
device uint64 // dev_t
inode_num uint64 // ino_t
}
// A namespace has associated attributes: a set of
// child namespaces and a set of member processes
type NamespaceAttribs struct {
children []NamespaceID // Child namespaces
pids []int // Member processes
}
// The following map records all of the namespaces that
// we find on the system
var NSList = make(map[NamespaceID]*NamespaceAttribs)
// Along the way, we'll discover the ancestor of all user
// namespaces (the root of the user namespace hierarchy).
var initialNS NamespaceID
// AddNamespace adds a PID to the list of PIDs associated with
// the user namespace referred to by 'namespaceFD'.
//
// The set of namespaces is recorded in the 'NSList' map.
// If the map does not yet contain an entry corresponding to
// 'namespaceFD', then an entry is created. This process is
// recursive: if the parent of the user namespace referred
// to by 'namespaceFD' does not have an entry in 'NSList'
// then an entry is created for the parent, and the namespace
// referred to by 'namespaceFD' is made a child of that namespace.
//
// When called recursively to create the ancestor namespace
// entries, this function is called with 'pid' as -1, meaning
// that no PID needs to be added for this namespace entry.
//
// The return value of the function is the ID of the namespace
// entry (i.e., the device ID and inode number corresponding to
// the user namespace file referred to by 'namespaceFD').
func AddNamespace(namespaceFD int, pid int) NamespaceID {
const NS_GET_PARENT = 0xb702 // ioctl() to get namespace parent
var sb syscall.Stat_t
var err error
// Obtain the device ID and inode number of the namespace
// file. These values together form the key for the 'NSList'
// map entry.
err = syscall.Fstat(namespaceFD, &sb)
if err != nil {
fmt.Println("syscall.Fstat(): ", err)
os.Exit(1)
}
ns := *new(NamespaceID)
ns = NamespaceID{sb.Dev, sb.Ino}
if _, fnd := NSList[ns]; fnd {
// Namespace already exists; nothing to do
} else {
// Namespace entry does not yet exist; create it
np := new(NamespaceAttribs)
NSList[ns] = np
// Get file descriptor for parent user namespace
r, _, e := syscall.Syscall(syscall.SYS_IOCTL,
uintptr(namespaceFD), uintptr(NS_GET_PARENT), 0)
parentFD := (int)((uintptr)(unsafe.Pointer(r)))
if parentFD == -1 {
switch (e) {
case syscall.EPERM:
// This is the initial NS; remember it
initialNS = ns
case syscall.ENOTTY:
fmt.Println("This kernel doesn't support " +
"namespace introspection");
os.Exit(1)
default:
// Unexpected error; bail
fmt.Println("ioctl()", e)
os.Exit(1)
}
} else {
// We have a parent user namespace; make sure it
// has an entry in the map. No need to add any
// PID for the parent entry.
par := AddNamespace(parentFD, -1)
// Make the current namespace entry ('ns') a child of
// the parent namespace entry
NSList[par].children = append(NSList[par].children, ns)
syscall.Close(parentFD)
}
}
// Add PID to PID list for this namespace entry
if pid > 0 {
NSList[ns].pids = append(NSList[ns].pids, pid)
}
return ns
}
// ProcessProcFile processes a single /proc/PID entry, creating
// a namespace entry for this PID's /proc/PID/ns/user file
// (and, as necessary, namespace entries for all ancestor namespaces
// going back to the initial user namespace).
// 'name' is the name of a PID directory under /proc.
func ProcessProcFile(name string) {
var namespaceFD int
var err error
// Obtain a file descriptor that refers to the user namespace
// of this process
namespaceFD, err = syscall.Open("/proc/"+name+"/ns/user",
syscall.O_RDONLY, 0)
if namespaceFD < 0 {
fmt.Println("Open: ", namespaceFD, err)
os.Exit(1)
}
pid, _ := strconv.Atoi(name)
AddNamespace(namespaceFD, pid)
syscall.Close(namespaceFD)
}
// DisplayNamespaceTree() recursively displays the namespace
// tree rooted at 'ns'. 'level' is our current level in the
// tree, and is used for producing suitably indented output.
func DisplayNamespaceTree(ns NamespaceID, level int) {
prefix := strings.Repeat(" ", level*4)
// Display the namespace ID (device ID + inode number)
fmt.Print(prefix)
fmt.Println(ns)
// Print a sorted list of the PIDs that are members of this
// namespace. We do a bit of a dance here to produce a list
// of PIDs that is suitably wrapped, rather than a long
// single-line list.
sort.Ints(NSList[ns].pids)
base := len(prefix) + 25
col := base
for i, p := range NSList[ns].pids {
if i == 0 || col >= 80 && col > base+32 {
col = base
if i > 0 {
fmt.Println()
}
fmt.Print(prefix)
fmt.Print(" ")
if i == 0 {
fmt.Print("PIDs: ")
} else {
fmt.Print(" ")
}
}
fmt.Print(strconv.Itoa(p) + " ")
col += len(strconv.Itoa(p)) + 1
}
fmt.Println()
// Recursively display the children namespaces
for _, v := range NSList[ns].children {
DisplayNamespaceTree(v, level+1)
}
}
func main() {
// Fetch a list of files from /proc
files, err := ioutil.ReadDir("/proc")
if err != nil {
fmt.Println("ioutil.Readdir(): ", err)
os.Exit(1)
}
// Process each /proc/PID (PID starts with a digit)
for _, f := range files {
if f.Name()[0] >= '0' && f.Name()[0] <= '9' {
ProcessProcFile(f.Name())
}
}
// Display the namespace tree rooted at the initial
// user namespace
DisplayNamespaceTree(initialNS, 0)
}
The following (abbreviated) output shows what happens when we run the program on a system where there are a few user namespaces. (We must run the program with privilege so that we can access the /proc/PID/ns/user files of all users' processes.)
$ sudo go run userns_overview.go
{3 4026531837}
PIDs: 1 2 3 5 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24
25 26 28 29 30 31 32 33 34 36 37 38 39 40 41 42 43 44 45
...
27101 27225 27245 27971 28142 28619 28870 28922 28995 29043
29109 29209 29279 29455 29466 29481 29489 29532 29533 29550
{3 4026532459}
{3 4026532663}
PIDs: 29745 29749 29823 29847
{3 4026532450}
{3 4026532662}
PIDs: 29746
The output of the program is somewhat primitive, but employs indentation to show the hierarchical relationships between the user namespaces. In all, there are five user namespaces shown above.
The first few lines show the initial user namespace and its member processes. The other user namespaces were created by an instance of the Google Chrome browser. The namespace with the inode number 4026532459 is a child of the initial user namespace. That namespace in turn has two descendants (4026532663 and 4026532450), and the last of those namespaces in turn has a descendant (4026532662).
The output also shows the PIDs of the processes that reside in each namespace. Two of the namespaces (inode numbers 4026532663 and 4026532662) have no member processes (but are pinned to existence by the presence of descendant user namespaces).
Some more details about the namespace introspection feature, as well as a simpler example program (in C) can be found in the namespaces(7) manual page.