Waking systems from suspend

March 2, 2011

This article was contributed by John Stultz

While the power consumption of an idle Linux system has been reduced greatly over the past few years, even more power can be saved by suspending or hibernating the system. Resume times have also gone down, increasing the usability of suspending a laptop even if you're just walking down the hallway to a meeting. And while suspend and hibernation were once features only found on portable devices like laptops, they have over the years become common on mobile embedded devices and non-portable desktops and servers. The power-saving benefits of suspend and hibernate come from the fact that most or all of the hardware is shut down, but this can be a limitation if you're expecting some functionality out of the system. It's the same reason sleeping at your desk is usually frowned upon.

But let's just say, if you were an extraordinary cat-napper, and you had some downtime between numerous kernel compiles while doing a long git-bisect: You could make it work, but first you would need a good alarm clock. The same can be said of computers.

The RTC

The RTC (Real Time Clock) is a fairly minor bit of hardware on your computer. It usually keeps track of the wall-clock time while the system is off or suspended. It also can be used to generate interrupts in a number of different modes (periodic, one-shot alarm, etc). This is all fairly normal functionality for a hardware timer device. But one of the most interesting features that most modern RTCs support is that an alarm interrupt can be generated even when the system is suspended (or in some hardware hibernation) forcing the machine to wake up.

On Linux the RTC is exposed to user space via the generic RTC driver infrastructure, which creates sysfs entries and a character device which can be used to set hardware alarms, change the interrupt mode, etc. A few applications out there make use of this interface, such as MythTV DVRs, which can trigger alarms so that media computers can be suspended until the start of a TV show that needs to be recorded.

The exposed interface is very much a low-level driver interface, where the values written by the application are sent directly to the hardware. This is a limitation, as it makes it so only one application at a time can program alarm events to an RTC device. For instance, with only a single RTC device, you can't have your system wake up for a nightly backup and also have it wake up to record your favorite show, unless you have some sort of centralized process managing the wakeups on behalf of other applications. Tutorials such as this one illustrate how complex and limiting this interface can be.

One way to overcome these limitations is to allow the kernel to manage a list of events and have it program the RTC so the alarm will trigger for the earliest event in the list. This avoids the need for user space applications to coordinate in order to share the hardware. To make this sharing possible, a generic "timerqueue" abstraction has been created to manage a simple list of timers that could then be shared with other areas of the kernel, like the high-resolution timers subsystem, that also have to manage timer events. This code was merged for 2.6.38.

The next step is to rework the RTC code so that, when an alarm is set via the character device ioctl() or sysfs interface, an rtc_timer event is created and enqueued into the per-RTC timerqueue instead of directly programming the hardware. The kernel then sets the hardware timer to fire for the earliest event in the queue. In effect, this mechanism virtualizes the RTC hardware, preserving the behavior of the existing hardware-oriented interfaces, while allowing the kernel to multiplex other events using the RTC.

The question now becomes, how to expose this new functionality so it can be used?

CLOCK_RTC

The first approach tried was exporting the new RTC functionality to user space directly via the POSIX clocks and timers interface. With this approach, there is a "clockid" assigned to each RTC device, so a user space application can use the POSIX interfaces to access the RTC. In this approach, clock_gettime() returns the current RTC time, clock_settime() sets the RTC time, and timer_settime() sets a POSIX timer to expire when the RTC reaches the desired time.

This approach is the most straightforward method of exposing the RTC, but it does have some disadvantages. Specifically, the RTC and system time may not be the same. On many systems, the RTC is set to local time rather than universal time. Thus, applications would need to make the extra effort to read the RTC and add to that value the time between now and when they want the timer to fire. Also, the RTC, due to simple clock skew, may not increase at the exact same rate as the system time. Additionally, since there may be multiple RTCs on a system, a single static CLOCK_RTC clockid would not be sufficient. Some form of dynamic clock_id registration is needed in order to export multiple clockids for multiple RTC devices. This functionality is desired for exposing other hardware clocks via the POSIX interface, and it is currently a work-in-progress by Richard Cochran.

Android Alarm Timers

Interestingly, the developers who have been working on Android have extended the RTC to be more useful as well. After all, smartphones are optimized to save power, so they try to stay in suspend as much as possible. But smartphones still have to wake up to do things like notify the user of calendar events or to check for email. In order to do this, The Android team introduced a concept called Android Alarm Timers. These timers use a hybrid approach: when when the system is running, alarm timers trigger a high-res timer to fire when an event is supposed to run; however, when the system goes into suspend, the alarm timers code looks at the list of events and sets the RTC to fire an alarm when the earliest event is to run. This avoids making applications deal with the (possibly unsyncronized) RTC time domain and allows applications to simply set timers and have them fire when expected, whether or not the system is suspended.

While never submitted to the kernel mailing list for inclusion, the Android Alarm Timers implementation would likely meet some resistance from the kernel community. For instance, the user-space interface for applications to use the Android Alarm Timers is via ioctl() to a new special character device (/dev/alarm) instead of using existing system call interfaces. Additionally, the ioctl() interface introduces new names for existing concepts in the kernel, duplicating CLOCK_REALTIME (which provides UTC wall time) and CLOCK_MONOTONIC (which counts from zero starting at system boot, and is not modified by settimeofday() calls) via the names ANDROID_ALARM_RTC and ANDROID_ALARM_SYSTEMTIME respectively.

The Android Alarm Timers interface does introduce some new useful concepts. For instance, the CLOCK_MONOTONIC clock does not increment during suspend. This is reasonable behavior when you want suspend to be transparent to applications, but when the system spends the majority of its time in suspend and you want to schedule events that wake the system up having only CLOCK_REALTIME increment over suspend can be limiting. So Android Alarm Timers introduces the ANDROID_ALARM_ELAPSED_REALTIME clock, which is similar to CLOCK_MONOTONIC, but includes time spent in suspend. But again, it is only introduced via an ioctl() to their special character device, and is not exposed via any other standard timekeeping interface.

Posix Alarm Timers

All in all, the Android Alarm Timers are a very interesting use case, and others in the community have suggested a similar hybrid approach. Inspired by the Android Alarm Timers, I implemented a similar hybrid alarm timers infrastructure on top of the previously-described work virtualizing the RTC interface. However, these timers are exposed to user space via the standard POSIX clocks and timers interface, using the new the CLOCK_REALTIME_ALARM clockid. The new clockid behaves identically to CLOCK_REALTIME except that timers set against the _ALARM clockid will wake the system if it is suspended. Additionally, because it's built upon the virtualized rtc_timers work, this implementation doesn't prohibit applications from making use of the existing legacy RTC interfaces. This gives us all the benefits of Android Alarm Timers, such as not forcing applications to deal with the RTC time domain, while making better use of existing kernel interfaces.

The code that implements the timerqueues and reworks the generic RTC layer to allow for multiplexing of events has been included in the 2.6.38 kernel release. The POSIX alarm timers layer will likely need additional review and discussion, in hopes of making sure the Android developers are able to assess compatibility issues in the design. For instance, I've proposed a new POSIX clock (CLOCK_BOOTTIME, along with a corresponding CLOCK_BOOTTIME_ALARM id) which would provide the incrementing-in-suspend value that the Android developers introduced with ANDROID_ALARM_ELAPSED_REALTIME. Also, while not likely to be included into mainline, Android's wakelocks have some interesting semantics with regards to their alarm timer interface. These semantics are not easily satisfied by the posix timers interface, but it is to be determined if we can get equivalent functionality using modified semantics and the mainline kernel's pm_wakeup interface.

Other open questions that need to be addressed are:

What capabilities should applications be required to have in order to set POSIX alarm timers?
In order to avoid systems waking up at inappropriate times (think laptop in a bag in the overhead compartment), should there be additional policy layers added so that user-generated suspends (like closing a laptop) inhibit POSIX alarm timers?

I also can imagine some interesting future work combining this functionality with the "Wake on Directed Packet" feature of some new network cards, which wake the system up any time a packet is sent to it. This feature could be used to allow web servers to function normally, servicing requests and running jobs, while suspending and saving power during longer idle periods.

While I might not be able to sleep on the job, I look forward to my desktop system being able to snooze and save electricity while knowing that cron jobs like nightly backups, downloading package updates or running updatedb will still be done.

Index entries for this article
Kernel	Timers
GuestArticles	Stultz, John

Waking systems from suspend

Posted Mar 3, 2011 5:37 UTC (Thu) by josh (subscriber, #17465) [Link] (5 responses)

Eventually, if we can ever get systems to suspend and resume in tens of milliseconds, and have some additional control over what devices remain powered (such as not turning off the screen), we can finally start suspending systems between interrupts when no processes need to run, rather than just putting the CPU to sleep.

Waking systems from suspend

Posted Mar 3, 2011 7:11 UTC (Thu) by jstultz (subscriber, #212) [Link] (1 responses)

Yes, the OLPC folks actually use a very similar strategy: http://wiki.laptop.org/go/Suspend_and_resume

That said, the suggestion isn't that far away from some of the run-time power-management approaches being worked on, where unused hardware is shutdown, and scheduling policies regulate what can execute while the system is "running". This would ideally, given proper hardware support, allow the same power-efficiencies as suspend when the system is idle.

I think both approaches are important. Runtime power-management makes the most of hardware features to save power while minimally impacting system latencies. While suspend power-management further restricts what the system will respond to, but possibly greatly increasing latencies.

If suspend/resume gets fast enough, and run-time power-management features in hardware get good enough, the two approaches might converge. And choosing which to use at that point might be a wash. But for now, I think its important that we chip away at the power-saving problem from both sides.

Waking systems from suspend

Posted Mar 3, 2011 8:46 UTC (Thu) by josh (subscriber, #17465) [Link]

I agree with you that we would ideally like every individual component to support low-power or no-power states equivalent to those used by suspend. If all devices supported that, I don't think we'd need suspend at all. You'd just have a set of independent policy decisions like "do we want to keep the network card alive to maintain a link and provide interrupts when we get packets".

Waking systems from suspend

Posted Mar 3, 2011 9:20 UTC (Thu) by tpetazzoni (subscriber, #53127) [Link] (2 responses)

This is something I implemented on a Blackfin system (Blackfin is a DSP architecture that runs the Linux kernel), which can enter suspend and get out of suspend in about 3-4 milliseconds.

So I modified the idle loop of the kernel so that if the next timer expiration event is enough far away in the future (say, 20 or 30ms), then I program the RTC to wake-up a few milliseconds before the scheduled expiration and then enter suspend. When I come back from suspend, I tweak the clocksource to make the system think that the time has continued to pass while the system was suspended, which has the effect of rescheduling the timer events so that the interrupt fires at the correct expected date, as if the system didn't enter suspend. It has been checked with an external scope that looks at GPIO toggling triggered by userspace Linux timers *and* the CPU voltage, and the frequency of the GPIO toggling was correct, and the CPU was completely off during the waiting periods.

I feel that suspend and idle are considered as two very separate things by many kernel developers, because on x86, suspend is such a slow operation that it can only be started explicitly by the user. But on many embedded architectures, suspend can just be a specific type of idle state.

Waking systems from suspend

Posted Mar 3, 2011 9:39 UTC (Thu) by johill (subscriber, #25196) [Link] (1 responses)

So, now we're curious: How much power did that save over just being idle? I think Blackfin is pretty low-power already?

Waking systems from suspend

Posted Mar 7, 2011 7:56 UTC (Mon) by tpetazzoni (subscriber, #53127) [Link]

I don't have the numbers anymore, but yes the difference was quite huge between idle and off. At least sufficient for motivating a fairly huge amount of work to get off-while-idle working on this platform.

Waking systems from suspend

Posted Mar 3, 2011 9:08 UTC (Thu) by johill (subscriber, #25196) [Link] (8 responses)

What I'm wondering about -- are there good interfaces to know why the system woke up? If there aren't, then the cron thing might not work all that well on a standard desktop since you wouldn't want it to go to sleep again when you woke it up by pressing the button, rather than by a timed trigger.

It seems to me that there will need to be some good management around these "why did we wake up" and "should we go to sleep again" stories.

Waking systems from suspend

Posted Mar 3, 2011 16:49 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (6 responses)

Are there any good interfaces? Well, yes there are, but unfortunately, different groups have wildly conflicting definitions of what constitutes “good”. The Android developers swear by wakelocks, but wakelocks have not been warmly received by many in the Linux kernel community (see here, here, here, here, here, here, here, and here). Many in the Linux kernel community swear by extensions to PM QOS, but the Android developers do not believe that these extensions are capable of replacing wakelocks — though these extensions might allow Android device drivers to be accepted into mainline, which would be very worthwhile in and of itself.

Hey, you asked!!! ;–)

Waking systems from suspend

Posted Mar 3, 2011 17:11 UTC (Thu) by johill (subscriber, #25196) [Link] (1 responses)

:-) I'm genuinely interested since we'll eventually run into more of this on more platforms, say a tablet.

The APIs you mention, do they actually allow you to know what device woke up the system? I thought they were mostly for not going to suspend.

Waking systems from suspend

Posted Mar 3, 2011 17:52 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

The trick used is a handoff of responsibility from the kernel to userspace. Userspace knows what device woke up the system because it just interacted with the corresponding device driver, and can then decide whether or not to continue holding off suspend. The kernel therefore only needs to hold off suspend until userspace has started the interaction, where this interaction might be a read() system call. So userspace would hold off suspend just before doing the read(). The kernel would stop holding off suspend as part of the read(). After the read() returned, userspace would use the data read to determine whether it should keep holding off suspend, and, if not, stop holding off suspend.

According to the Android developers, wakelocks handle this handoff in a natural way. Others argue that all of the suspend-blocking work should happen in user space, so that the kernel does not need to worry about it. And there are probably a large number of other opinions out there about how all of this should work, both informed and otherwise. ;-)

Waking systems from suspend

Posted Mar 5, 2011 7:48 UTC (Sat) by swetland (guest, #63414) [Link] (3 responses)

We (the Android Kernel Team) expect that we will be able to move to the driver level API side of the wakelock/suspendblock work to shift our driver work to be mainline-friendly.

The last proposals I've seen around userspace interface and suspend interface involve polling/spinning and I doubt we'll make use of them. The nice thing is this is pretty localized so it'll be easy enough to maintain patches for a reasonable userland interface on our side and not have to worry about any special changes for drivers. A nice step forward.

Regarding timer interfaces, are there non-signals-based ways of interacting with posix timers? Being able to select() on an fd for an alarm event works very well for us now and we're not terribly keen on moving to a signals-based universe.

Waking systems from suspend

Posted Mar 5, 2011 23:34 UTC (Sat) by jstultz (subscriber, #212) [Link] (2 responses)

clock_nanosleep() might be one possibility?

Although from my brief discussions with Arve it sounded like the semantics of the android alarm timer interface is a little particular (especially with regards to wakelocks), so I doubt there will be an alternate implementation that will provide an exact 1:1 mapping.

That said, the rework of the RTC layer as well as the implementation of CLOCK_BOOTTIME (tglx just pulled it into -tip) will hopefully greatly simplify the android alarm timer code. So there may be a future for both the posix alarm timers and some form of the /dev/alarm device to co-exist, sharing a good bit of code.

Even so, while I think the posix interface for alarm timers provides a fairly nice and consistent interface that application developers are used to using, I'd greatly appreciate feedback and suggestions for alternative interfaces. Maybe we need something like clock_select()?

Waking systems from suspend

Posted Mar 7, 2011 14:39 UTC (Mon) by alonz (subscriber, #815) [Link] (1 responses)

Wouldn't it be better to just enhance timerfd() so it can take these new types of clock IDs as well as CLOCK_REALTIME / CLOCK_MONOTONIC?

Waking systems from suspend

Posted Mar 7, 2011 18:16 UTC (Mon) by jstultz (subscriber, #212) [Link]

That might be a good approach. I'll look into it!

Waking systems from suspend

Posted Mar 3, 2011 18:54 UTC (Thu) by jstultz (subscriber, #212) [Link]

So, I think for future debugging, the question of "why a system woke up" is quite useful. Its possible something similar to the /proc/timers_list would be wanted so analisys applications similar to power-top can monitor why we are waking up.

But I think "why a system woke up" question is probably less important then the decision of "when should it go back to sleep?".

Currently suspend is initiated by userland. So its outside of the kernel's scope for the moment.

But I can see one potential issue where maybe there's a system that will suspend after 15 minutes of X-idle. A alarm timer fires, and then wakes the system to do some work, but that work takes longer then 15 minutes, so the work gets cut off when the system suspends again.

You could have the application contact the power-management daemon and inhibit suspend while the critical work was being done (much like how slideshow or movie applications do this). This would probably work for the most part, but there are still some races between the wakeup and the userland inhibit message being sent.

And indeed, as Paul mentioned there are a number of different approaches being worked with here, with wakelocks being the most contentious, but also most complete.

Waking systems from suspend

Posted Mar 4, 2011 18:41 UTC (Fri) by ofeeley (guest, #36105) [Link]

Nice to see some useful contributions from the android cousins!

Waking systems from suspend

Posted Mar 18, 2011 3:49 UTC (Fri) by kevinm (guest, #69913) [Link]

Stopping CLOCK_MONOTONIC during suspend was a mistake - it should always have had the semantics described for CLOCK_BOOTTIME. POSIX is *very* clear on this:

If the Monotonic Clock option is supported, all implementations shall support a clock_id of CLOCK_MONOTONIC defined in <time.h>. This clock represents the monotonic clock for the system. For this clock, the value returned by clock_gettime() represents the amount of time (in seconds and nanoseconds) since an unspecified point in the past (for example, system start-up time, or the Epoch). This point does not change after system start-up time. The value of the CLOCK_MONOTONIC clock cannot be set via clock_settime(). This function shall fail if it is invoked with a clock_id argument of CLOCK_MONOTONIC.

The most common use of CLOCK_MONOTONIC is to determine the time that has elapsed between two events. With the current Linux CLOCK_MONOTONIC, this will give an incorrect result if system suspend happens between the two events.

MythTV

Posted Mar 30, 2011 12:18 UTC (Wed) by pepsiman (guest, #22382) [Link]

> A few applications out there make use of this interface, such as MythTV
> DVRs, which can trigger alarms so that media computers can be suspended
> until the start of a TV show that needs to be recorded.

MythTV can be configured to suspend, but it normally does a shutdown.