ID mapping for mounted filesystems

By Jonathan Corbet
November 19, 2020

Almost every filesystem (excepting relics like VFAT) implements the concept of the owner and group of each file; the higher levels of the operating system then use that information to control access to those files. For decades, it has usually sufficed to track a single owner and group for each file, but there is an increasing number of use cases wanting to make that ownership relative to the environment any given process is running in. Developers have been working for a few years to find solutions to this problem; the latest attempt is the ID-mapped mounts patch set from Christian Brauner.

In truth, the ID-mapping problem is not exactly new. User and group IDs for files only make sense across a management domain if there is a single authority controlling the assignment of those IDs. Since that is often not the case, network filesystems like NFS have had the ability to remap IDs for many years. The growth of virtualization and container technologies has brought the problem closer to home; there can be multiple management domains running on a single machine. The NFS ID-remapping mechanism is of little use if NFS itself is not being used.

For example, container runtime systems may want to provide a common root image to each container. User namespaces may be used to ensure that each container is running with a set of nonprivileged IDs on the host system, but those containers should be able to access their root images with root privileges. Mounting that image with ID remapping would make this possible. Similarly, ID remapping would make it easier to share filesystems between containers regardless of the IDs in use within each container. Or consider systemd-homed, which provides consistent access to a user's home directory across machines. If a user logs into a system and is given a user ID that doesn't match the ownership of their home directory, systemd-homed will change the ownership of all files in and below the home directory — not an especially efficient operation. ID remapping would solve the problem in a more satisfying way.

There have been a number of previous attempts to address these use cases. The shiftfs filesystem was designed to be stacked on top of an ordinary filesystem; it would then remap user and group IDs in operations as they passed through. That idea then evolved into shifting bind mounts, which moved the ID-mapping function into the virtual filesystem (VFS) layer. Shortly after that, Brauner proposed FSID mappings, which repurposed the kernel's filesystem-ID abstraction to perform the remapping. Now, with ID-mapped mounts, the remapping is again handled within the VFS, but with a twist.

This patch set adds a new pointer to the vfsmount structure that represents a mounted filesystem; this pointer, called mnt_user_ns, points to a user namespace. One of the key features of user namespaces is, of course, ID remapping; a process that is running within a user namespace will already have its user and group IDs remapped for any operation, including filesystem operations, that reaches outside of the namespace. But user namespaces have a single map that applies to all operations, and to all mounted filesystems; attaching a user namespace to the vfsmount structure allows every mounted filesystem to have a different mapping.

Setting up ID-mapped mounts, thus, involves the creation of user namespaces to contain the ID-mapping tables. These user namespaces will, most likely, never have processes running within them; in a sense, much of their functionality is wasted in this context. But this approach made it possible to use all of the existing ID-mapping helpers, while creating a more focused ID-mapping abstraction would require duplicating much of that functionality.

By default, mounted filesystems will point to the initial user namespace, which is taken as an indication that no remapping is to be done at that layer. Code that wants to add ID mapping to a mounted filesystem has to start by creating a new user namespace; this is a bit of a roundabout procedure that is not directly supported by the kernel. In a sample mount-idmapped tool written by Brauner, this task is done by creating a new process within its own user namespace. The child process does nothing but suspend itself with a SIGSTOP signal while the parent creates a reference to the child's user namespace by opening the associated /proc file.

The next step is to establish the ID mapping in the newly created user namespace; this is done by writing appropriate values to the uid_map and gid_map files in the child process's /proc directory. Once that has been done, the child can just be killed off; the open file descriptor to its user namespace will ensure that it will stay around after the process is gone.

Actually associating the user namespace is done with the mount_setattr() system call, which is also added by this patch set:

    struct mount_attr {
	__u64 attr_set;
	__u64 attr_clr;
	__u64 propagation;
	__u64 userns_fd;
    };

    int mount_setattr(int dfd, const char *path, unsigned int flags,
    		      struct mount_attr *attr, size_t attr_size);

The attr_set and attr_clr fields of the mount_attr structure describe the attributes to be set and cleared, respectively; propagation controls whether this operation affects only the filesystem indicated by dfd and path or whether it also affects all filesystems currently mounted underneath it. To add ID mapping to a filesystem, the caller (who must have the CAP_SYS_ADMIN capability in the current patches) should set MOUNT_ATTR_IDMAP in attr_set, and set userns_fd to the file descriptor for the relevant user namespace.

While ID mapping can apparently be set up for any filesystem mount, the feature is expected to be mostly used with bind mounts, which create a new view of an existing filesystem. The above-linked cover letter for the patch series gives a number of examples of how this capability could be used. A simple one involves just providing a view of a directory with the files owned by a different user ID. Another creates an identity mapping (so IDs don't change), but that mapping lacks user ID 0, preventing access as root. Filesystems without the concept of user IDs (such as VFAT) can have those IDs grafted onto them with ID-mapped mounts. And so on.

The previous posting of this patch set generated a certain amount of interest. This work seems to have the approval of the VFS developers, which is a significant hurdle that any patch in this area must overcome. So it might just be that a solution to the ID-mapping problem has finally been found and there will be no need for yet another attempt — maybe.

Index entries for this article
Kernel	Filesystems/Mounting
Kernel	Namespaces/User namespaces

System call to create a new user namespace

Posted Nov 20, 2020 8:27 UTC (Fri) by skissane (subscriber, #38675) [Link] (2 responses)

> Code that wants to add ID mapping to a mounted filesystem has to start by creating a new user namespace; this is a bit of a roundabout procedure that is not directly supported by the kernel. In a sample mount-idmapped tool written by Brauner, this task is done by creating a new process within its own user namespace. The child process does nothing but suspend itself with a SIGSTOP signal while the parent creates a reference to the child's user namespace by opening the associated /proc file.

Ideally the kernel would have a "create_user_ns" or "new_user_ns" system call which allocates a fresh user_ns and returns a file descriptor to its /proc directory, without any of these shenanigans required. Then openat could be used to get at the uid_map and gid_map files

System call to create a new user namespace

Posted Nov 20, 2020 14:32 UTC (Fri) by c5h5n5o (guest, #128645) [Link] (1 responses)

Can that still work without having procfs mounted?

System call to create a new user namespace

Posted Nov 20, 2020 19:59 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

Why not a pidfd? Can procfs info be retrieved via ioctl calls on them?

ID mapping for mounted filesystems

Posted Nov 21, 2020 15:50 UTC (Sat) by brauner (subscriber, #109349) [Link]

> These user namespaces will, most likely, never have processes running within them; in a sense, much of their functionality is wasted in this context. But this approach made
> it possible to use all of the existing ID-mapping helpers, while creating a more focused ID-mapping abstraction would require duplicating much of that functionality.

Another motivation was that whenever a filesystem is shared with an unprivileged container from the host or between unprivileged and privileged containers the user namespace of the container will usually be attached to the vfsmount. So for a lot of cases there's no additional namespace. Just whenever there's a complicated custom mapping required. Other use-cases where systemd maps a host mount to another set of ids it would just allocate a single namespace for the logged in user with the ids it allows to remap (Can be up to 340 individual mappings, ranges.) and can then mark whatever it wants.

ID mapping for mounted filesystems

Posted Nov 26, 2020 9:41 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

> the caller (who must have the CAP_SYS_ADMIN capability in the current patches)

So, I take it they never got around to formalizing a "No more adding stuff to CAP_SYS_ADMIN" rule?

ID mapping for mounted filesystems

Posted Nov 30, 2020 12:32 UTC (Mon) by motiejus (subscriber, #92837) [Link] (1 responses)

I am so excited about this.

I am running a few applications on my home server which have access to the same set of personal files (`/data`):

- sshfs: everything.
- syncthing: mostly everything.
- rslsync: photos of a family member who uses iOS (there is no syncthing on iOS).

Now each application will have their own uid, with a "real" isolation from files they do not need. E.g. rslsync will have access only to photos of the family member, and restricted in all the other ways I can think of.

ID mapping for mounted filesystems

Posted Jan 11, 2021 19:02 UTC (Mon) by immibis (subscriber, #105511) [Link]

You don't need this feature in order for each application to have a different UID. This feature is for when you have a drive with files owned by root (or any UID), but you want the kernel to translate that so it looks like they're owned by someone else (a different UID).

E.g. mounting someone else's ext4 USB stick so that you can access the files without sudo.

ID mapping for mounted filesystems

Posted Dec 27, 2021 11:03 UTC (Mon) by simlo (guest, #10866) [Link]

In my previous job I tried to build identical development environments via containers. One of the requirements I came up with was that it should run on almost all x86-64 Linux installations without having a sysadm around. With username spaces I succeeded with Bubblewrap which mounted a Fedora base image as rootfs and ran a lot of dnf install commands.

But I had to do hacks: every service user had to have uid 0 which was mapped to the developers uid in the image creation process. It had to be added to users and groups before installing packages.
The way rootless docker and podman does it with remapping users requires an admin to set up uids belonging to the individual developer to be used for remapping root and service accounts.

It would be very practical to simply be able to map all to one...