Linux driver: Creating device files for a hotpluggable device
Introduction
Most device drivers hook onto an already existing subsystem: Audio, networking, I2C, whatever. If it’s a specialized piece of hardware, a single device file is typically enough, and you get away with a misc device or, as shown in the kernel’s USB example (drivers/usb/usb-skeleton.c), with the USB framework’s usb_register_dev() call.
However in some cases, a number of dedicated device files are required, belonging to a dedicated class. Only in these cases does it make sense to generate the device files explicitly. A driver that does this for no good reason will be hard to get into the Linux kernel tree.
Hotpluggable devices are tricky in particular because of the two following reasons:
- They may vanish at any time, so the driver must be sure to handle that properly. Namely, not to free any resources as long as they may be accessed by some other thread.
- It’s impossible initialize the device driver once and for all against a known set of devices when the driver loads.
This post focuses on USB devices, and in particular on deleting the device files and releasing the kernel that support them when the USB device is unplugged.
Spoiler: There’s no problem deleting the device files themselves, but handling the cdev struct requires some attention.
Everything in this post relates to Linux kernel v5.3.
A reminder on device files
This post takes the down-to-details approach to allocating majors and minors, which isn’t necessarily recommended: It’s by far easier to use register_chrdev() than to muck about with alloc_chrdev_region() and setting up the cdev struct directly. However the good old Linux Device Driver book suggests to use the latter, and deems register_chrdev() to be “The Older Way”, which is about to be removed from the kernel. That was the 2005 edition, and fast forward to 2020, no other than drivers/char/misc.c uses register_chrdev(). Some drivers go one way, others go the other.
So first, we let’s look at the usual (by the book, literally) way to do it, which is in fact unsuitable for a hotpluggable device. But we have to start with something.
There are three players in this game:
- The cdev struct: Makes the connection between a set of major/minors and a pointer to a struct file_operations, containing the pointers to the functions (methods) that implement open, release, read, write etc. The fops, in short.
- The device files: Those files in /dev that are accessible by user space programs.
- The class: Something that must be assigned to device files that are created, and assists in identifying their nature (in particular for the purpose of udev).
The class is typically a global variable in the driver module, and is created in its init routine:
static struct class *example_class; static int __init example_init(void) { example_class = class_create(THIS_MODULE, examplename); if (IS_ERR(example_class)) return PTR_ERR(example_class); return 0; }
Because a class is typically global in the module, and hence not accessible elsewhere, it’s impossible to create device files on behalf of another class without using their API and restrictions (for example, misc devices). On the other hand, if you try to push a driver which creates a new class into the kernel tree, odds are that you’ll have to explain why you need to add yet another class. Don’t expect a lot of sympathy on this matter.
The next player is the cdev struct. Its role is to connect between a major + a range of minors and file operations. It’s typically part of a larger struct which is allocated for each physical device. So it usually goes something like
struct example_device { struct device *dev; struct cdev cdev; int major; int lowest_minor; /* Highest minor = lowest_minor + num_devs - 1 */ int num_devs; struct kref kref; /* Just some device related stuff */ struct list_head my_list; __iomem void *registers; int fatal_error; wait_queue_head_t my_wait; }
The only part that is relevant for this post is the struct cdev and the others marked in bold, but I left a few others that often appear in a IOMM device.
Note that the example_device struct contains the cdev struct itself, and not a pointer to it. This is the usual way, but that isn’t the correct way for a USB device. More on that below.
As mentioned above, the purpose of the cdev struct is to bind a major/minor set to a struct of fops. Something like
static const struct file_operations example_fops = { .owner = THIS_MODULE, .read = example_read, .write = example_write, .open = example_open, .flush = example_flush, .release = example_release, .llseek = example_llseek, .poll = example_poll, };
The cdev is typically initialized and brought to life with something like
struct example_device *mydevice; dev_t dev; rc = alloc_chrdev_region(&dev, 0, /* minor start */ mydevice->num_devs, examplename); if (rc) { dev_warn(mydevice->dev, "Failed to obtain major/minors"); return rc; } mydevice->major = major = MAJOR(dev); mydevice->lowest_minor = minor = MINOR(dev); cdev_init(&mydevice->cdev, &example_fops); mydevice->cdev.owner = THIS_MODULE; rc = cdev_add(&mydevice->cdev, MKDEV(major, minor), mydevice->num_channels); if (rc) { dev_warn(mydevice->dev, "Failed to add cdev. Aborting.\n"); goto bummer; }
So there are a number of steps here: First, a major and a range of minors is allocated with the call to alloc_chrdev_region(), and the result is stored in the dev_t struct.
Then the cdev is initialized and assigned a pointer to the fops struct (i.e. the struct that assigns open, read, write release). It’s the call to cdev_add() that makes the module “live” as a device file handler by binding the fops to the set of major/minor set that was just assigned.
If there happens to exist files with the relevant major and minor in the file system, they can be used immediately to execute the methods in example_fops. This is however not likely in this case, since they were allocated dynamically. The very common procedure is hence to create them in the driver (which also triggers udev events, if such are defined). So there can be several calls to something like:
device = device_create(example_class, NULL, MKDEV(major, i), NULL, "%s", devname);
Note that this is the only use of example_class. The class has nothing to do with the cdev.
And of course, all this must be reverted when the USB device is unplugged. So we’re finally getting to business.
Is it fine to remove device files on hot-unplugging?
Yes, but remember that the file operation methods may very well be called on behalf of file descriptors that were already opened.
So it’s completely OK to call device_destroy() on a device that is still opened by a process. There is no problem creating device files with the same names again, even while the said process still has the old file opened. It’s exactly like any inode, which is visible only by the process(es) that has a file handle on them. A device file is just a file. Remove it, and it’s really gone only when there are no more references to it.
In fact, for a simple “cat” process that held a deleted device, the entry in /proc went
# ls -l /proc/756/fd total 0 lrwx------. 1 root root 64 Mar 3 14:12 0 -> /dev/pts/0 lrwx------. 1 root root 64 Mar 3 14:12 1 -> /dev/pts/0 lrwx------. 1 root root 64 Mar 3 14:12 2 -> /dev/pts/0 lr-x------. 1 root root 64 Mar 3 14:12 3 -> /dev/example_03 (deleted)
So no drama here. Really easy.
Also recall that removing the device files doesn’t mean all that much: It’s perfectly possible (albeit weird) to generate extra device files with mknod, and use them regardless. The call to device_destroy() won’t make any difference in this case. It just removes those convenience device files in /dev.
When to release the cdev struct
Or more precisely, the question is when to release the struct that contains the cdev struct. The kernel example’s suggestion (drivers/usb/usb-skeleton.c) is to maintain a reference counter on the enclosing struct (a kref). Then increment the reference count for each file opened, decrement for each file release, and also decrement it when the device is disconnected. This way, the device information (e.g. example_device struct above) sticks around until the device is disconnected and there are no open files. There is also an issue with locking, discussed at the end of this post.
But when cdev is part of this struct, that is not enough. cdev_del(), which is normally called in the device’s disconnect method, disables the accessibility of the fops for opening new file descriptors. But there’s much to the comment from fs/char_dev.c, above the definition of cdev_del(): “This guarantees that cdev device will no longer be able to be opened, however any cdevs already open will remain and their fops will still be callable even after cdev_del returns.”
So what’s the problem, one may ask. The kref keeps the cdev until the last release! (hopefully with proper locking, as discussed at the end of this post)
Well, that’s not good enough: It turns out that the struct cdev is accessed after the fops release method has been called, even for the last open file descriptor.
Namely, the issue is with __fput() (defined in fs/file_table.c), which is the function that calls the fops release method, and does a lot of other things that are related to the release of a file descriptor (getting it off the epoll lists, for example): If the released inode is a character device, it calls cdev_put() with the cdev struct after the release fops method has returned.
Which makes sense, after all. The cdev’s reference count must be reduced sometime, and it can’t be before calling the release, can it?
So cdev_put calls kobject_put() on the cdev’s kobject to reduce its reference count. And then module_put() on the owner of the cdev entry (the owning module, that is) as given in the @owner entry of struct cdev.
Therefore, there’s a nasty OOPS or a kernel panic if the struct cdev is on a memory segment that has been freed. Ironically enough, the call to cdev_put() brings the cdev’s reference count to zero if cdev_del() has been called previously. That, in turn, leads to a call to the kobject’s release method, which is cdev_default_release(). In other words, the oops is caused by the mechanism that is supposed to prevent the cdev (and the module) the exact problem that it ends up creating.
Ironic, but also the hint to the solution.
The lame solution
The simplest way is to have the cdev as a static global variable of the relevant module. Is this accepted practice? Most likely, as Greg K-H himself manipulated a whole lot of these in kernel commit 7e7654a. If this went through his hands, who am I to argue. However this goes along with allocating a fixed pool of minors for the cdev: The number of allocated minors is set when calling cdev_add().
The backside is that cdev_add() can only be called once, so the range of minors must be fixed. This is commonly solved by setting up a pool of minors in the module’s init routine (256 of them in usb-skeleton.c, but there are many other examples).
Even though it’s a common solution in the kernel tree, I always get a slight allergy to this concept. How many times have we heard that “when it was designed, it was considered a lot” thing?
The elegant (but lengthy) solution
As hinted above, register_chrdev() is better used instead of this solution. If you read through the function’s source in fs/char_dev.c, it’s quite apparent that it does pretty much the same as described here. It has a peculiarity, though: It’s hardcoded to allocate exactly 256 minors, in the range of 0-255. alloc_chrdev_region(), on the other hand, allows choosing the number of requested minors. But then, there’s __register_chrdev(), which is exported and allows setting the range accurately, if 256 minors isn’t enough, or the range needs to be set explicitly.
That said, let’s go back to the lengthy-elegant solution. In short: Allocate the cdev dynamically. Instead of
struct example_device {
struct device *dev;
struct cdev cdev;
[ ... ]
}
go
struct example_device { struct device *dev; struct cdev *cdev; [ ... ] }
so the cdev struct is referred to with a pointer instead. And instead of the call to cdev_init(), go:
mydevice->cdev = cdev_alloc(); if (!mydevice->cdev) goto bummer; mydevice->cdev->ops = &example_fops; mydevice->cdev->owner = THIS_MODULE;
And from there go ahead as usual. The good part is that there’s no need to free a cdev that has been allocated this way. The kernel frees it automatically when its reference count goes down to zero (it starts at one, of course). So all in all, the kernel counts the references to cdev as files are opened and closed. In particular, it decrements it when cdev_del() is called. So it really vanishes only when it’s not needed anymore.
Note that cdev_init() isn’t called. Doing this will cause a kernel memory leak (which won’t affect the allocation of major and minors, by the way). See “Read the Source” below, which also shows the details on how this solves the problem.
Only note that if cdev_add() fails, the correct unwinding is:
rc = cdev_add(&mydevice->cdev, MKDEV(major, minor), mydevice->num_channels); if (rc) { dev_warn(mydevice->dev, "Failed to add cdev. Aborting.\n"); kobject_put(&mydevice->cdev->kobj); goto bummer2; }
In other words, don’t call cdev_del() if cdev_add() fails. It’s can’t be deleted if it hasn’t been added. Decrementing its reference count is the reverse operation. This is how it’s done by __register_chrdev(), defined in char_dev.c. That’s where cdev_add() and cdev_del() are defined, so they should know…
Know cdev’s reference count rules
Since cdev’s is wiped out by the kernel, it’s important to know how the kernel counts its reference count. So these are the rules:
- cdev is assigned a reference count of 1 by the call to cdev_alloc() (by virtue of kobject_init). Same goes for cdev_init(), but that’s irrelevant (see code snippets below).
- cdev’s reference count is not incremented by the call to cdev_add(). So it stays on 1, which is sensible.
- cdev’s reference count is decremented on a call to cdev_del(). This makes sense, even though it kind-of breaks the symmetry with cdev_add(). But the latter takes a free ride on the ref count of cdev_alloc(), so that’s how it comes together.
- A reference increment is done for each opened related file, and decremented on file release.
For the extra pedantic, it may seem necessary to call kobject_get(&mydevice->cdev->kobj) immediately after cdev_alloc(), and then kobject_put() only after freeing the example_device struct, because it contains the pointer to the cdev. This is what reference counting means: Count the pointers to the resource. However since the cdev struct is typically only used for the cdev_del() call, nothing bad is likely to happen because of this pointer to nowhere after the cdev has been freed. It’s more a matter of formality.
This extra reference count manipulation can also be done with cdev_get() and cdev_put(), but will add an unnecessary and possibly confusing (albeit practically harmless) reference count to the module itself. Just be sure to set the cdev’s @owner entry before calling cdev_get() or things will get messy.
Read the Source
Finally, I’ll explain why using cdev_alloc() really helps. The answer lies in the kernel’s fs/char_dev.c.
Let’s start with cdev_init(). It’s short:
void cdev_init(struct cdev *cdev, const struct file_operations *fops) { memset(cdev, 0, sizeof *cdev); INIT_LIST_HEAD(&cdev->list); kobject_init(&cdev->kobj, &ktype_cdev_default); cdev->ops = fops; }
Noted that kobject_init? It initializes a kernel object, which is used for reference counting. And it’s of type ktype_cdev_default, which in this case only means that the release function is defined as
static struct kobj_type ktype_cdev_default = { .release = cdev_default_release, };
So when cdev->kobj’s reference count goes to zero, cdev_default_release() is called. Which is:
static void cdev_default_release(struct kobject *kobj) { struct cdev *p = container_of(kobj, struct cdev, kobj); struct kobject *parent = kobj->parent; cdev_purge(p); kobject_put(parent); }
Arrgghh! So there’s a release function! Why can’t it free the memory as well? It wouldn’t have been perfect. Well, a catastrophe, in fact. How could it free a memory segment within another enclosing struct?
But in fact, there is such a release function, with a not-so-surprising name:
static void cdev_dynamic_release(struct kobject *kobj) { struct cdev *p = container_of(kobj, struct cdev, kobj); struct kobject *parent = kobj->parent; cdev_purge(p); kfree(p); kobject_put(parent); }
Exactly the same, just with the kfree() in exactly the right spot. Backed up by
static struct kobj_type ktype_cdev_dynamic = { .release = cdev_dynamic_release, };
and guess which function uses it:
struct cdev *cdev_alloc(void) { struct cdev *p = kzalloc(sizeof(struct cdev), GFP_KERNEL); if (p) { INIT_LIST_HEAD(&p->list); kobject_init(&p->kobj, &ktype_cdev_dynamic); } return p; }
Now let’s compare it with cdev_init():
- It allocates the cdev instead of using an existing one. Well, that’s the point, isn’t it?
- It doesn’t call memset(), because the segment is already zero by virtue of kzalloc.
- It doesn’t assign cdev->fops, because it doesn’t have that info. The driver is responsible for this now.
- It sets the kernel object to have a release method that includes the kfree() part, of course.
This is why cdev_init() must not be called after cdev_alloc(): Even though it will do nothing harmless apparently, it will re-init the kernel object to ktype_cdev_default. That’s easily unnoticed, since the only thing that will happen is that kfree() won’t be called. Causing a very small, barely notable, kernel memory leak. No disaster, but people go to kernel-hell for less.
When and how to free example_device
Now back to the topic of maintaining a reference count on the device’s information (e.g. struct example_device). It should contain this struct kref, which allows keeping a track on when the struct itself should be kept in memory, and when it can be deleted. As mentioned earlier, the kref is automatically initialized with a reference count of 1, and is then incremented every time the open method is called for a related device file, decremented for every release of such, and once again decremented when the device itself is disconnected.
On the face of it, easy peasy: The struct goes away when there are no related open device files, and the device itself is away too. But what if there’s a race condition? What if a file is opened at the same time that the device is disconnected? This requires a mutex.
The practice for using kref is to decrement the struct’s reference count with something like
kref_put(&mydevice->kref, cleanup_dev);
where cleanup_dev is a function that is called if the reference count reached zero, with a pointer to the kdev. The function then uses container_of to find the address of the structure containing the kref, and frees the former. Something like
static void cleanup_dev(struct kref *kref) { struct example_device *dev = container_of(kref, struct example_device, kref); kfree(dev); }
The locking mechanism is relatively simple. All it needs to ensure is that the open method doesn’t try to access the example_device struct after it has been freed. But since the open method must do some kind of lookup to find which example_device struct is relevant, by checking if it covers the major/minor of the opened device file, the name of the game is to unlist the example_device before freeing its memory.
So if the driver implements a list of example_device structs, one for each connected USB device, all that is necessary is to protect the access to this list with a mutex, and to hold that mutex while kref_put() is called. Likewise, this mutex is taken by the open method before looking in the list, and is released only after incrementing the reference count with kref_get().
And then make sure that the list entry is removed in cleanup_dev.
The major / minor space waste
Reading through fs/char_dev.c, one gets the impression that Linus intended to allocate majors and minors in an efficient manner when he wrote it back in 1991, but then it never happened: There’s an explicit implementation of a hash for holding the ranges of majors and minors, and the relevant routines insert entries into it, and remove them as allocations are made and dropped.
But then it seems like alloc_chrdev_region(), which allocates major / minor space dynamically, sets the minor base to zero in its call to __register_chrdev_region(). The latter calls find_dynamic_major(), which as its name implies, looks up a major that isn’t used at all. In no way is there an attempt to re-use a major by subsequent alloc_chrdev_region() calls.
The truth is that there’s no practical reason to. Looking at /proc/devices, there aren’t so many majors allocated on a typical system, so there’s no drive to optimize.
Bonus: When is it OK to access the USB API’s structs?
Not directly related, but definitely worth mentioning: The memory chunk, to which struct usb_interface *interface points to (which is given to both probe and disconnect) is released after the call to the disconnect method returns. This means that if any other method holds a copy of the pointer and uses it, there must be some kind of lock that prevents the disconnect call to return as long as this pointer may be in use. And of course, prevents any other thread to start using this pointer after that. Otherwise even something as innocent as
dev_info(&interface->dev, "Everything is going just great!\n");
may cause a nasty crash. Sleeping briefly on the disconnect method is OK, and it solves this issue. Just be sure no other thread sleeps forever with that lock taken. Should not be an issue, because asynchronous operations on the USB API have no reason to block.
This is demonstrated well in the kernel’s own usb-skeleton.c, by virtue of io_mutex. In the disconnection method, it goes
mutex_lock(&dev->io_mutex); dev->interface = NULL; mutex_unlock(&dev->io_mutex);
and then, whenever the driver wants to touch anything related USB, it goes
mutex_lock(&dev->io_mutex); if (!dev->interface) { mutex_unlock(&dev->io_mutex); retval = -ENODEV; goto error; }
and keeps holding that mutex during all business with the kernel’s USB API. Once again, this is reasonable when using the asynchronous API, so no call blocks.
It’s however not possible to hold this mutex in URB completer callbacks, since they are executed in atomic context (an interrupt handler or tasklet). These callbacks routines are allowed to assume that the interface data is legit throughout their own execution, because by default the kernel’s USB subsystem makes sure to complete all URBs (with a -ENOENT status), and prevent submitting new ones, before calling the device’s disconnect method (for example, in usb-skeleton.c, dev->interface->dev is used for an error message in the completion callbacks).
The soft_unbind flag
This default behavior makes a lot of sense when the device is physically unplugged, but it also applies when the driver is about to be unloaded (e.g. with rmmod). In the latter case, there is no need to cut off communication abruptly, and sometimes it’s desired to wrap up cleanly with some final URBs. To facilitate that, the driver can set the soft_unbind flag, which means “if set to 1, the USB core will not kill URBs and disable endpoints before calling the driver’s disconnect method”. When this flag is set, it’s the driver’s responsibility to make sure there are no outstanding URBs when the disconnect method returns as well as when the probe method returns with error, and that none are queued later on. Or even stricter, there must be no outstanding URBs when dev->interface is nullified before returning. But that’s it. There are no other implications.
It’s worth saying this again: If the probe method returns with error, the USB framework normally kills all outstanding URBs. But it won’t do that if soft_unbind is set (as of kernel v5.12). The fix for this should have been that the framework kills all outstanding URBs after probe returns with an error (as well as when disconnect returns, actually) because any proper driver should have done that anyhow. I would submit a patch, but last time I did that the relevant maintainers played silly (and time consuming) games with me, so I made sure my own driver gets this right and called it a day.
The soft_unbind flag affects the behavior of usb_unbind_interface() (in usb/core/driver.c), which sets intf->condition to USB_INTERFACE_UNBINDING and then checks soft_unbind. If false, it calls usb_disable_interface() to terminate all URBs before calling the disconnect method. So this ensures no new URBs are queued and the old ones are completed. So once again, it boils down to whether the USB framework kills the URBs before calling the disconnect method, or the driver does the same before returning from it.
usb-skeleton.c is misleading in this matter (as of kernel v5.12): skel_disconnect() calls usb_kill_urb() and usb_kill_anchored_urbs() even though soft_unbind isn’t set. Hence there are no URBs to kill by the time these calls are made, and they do nothing. It’s likewise questionable if setting dev->disconnected to prevent I/O from starting is necessary, but I haven’t dived into that issue.