Clock Gating: Smart Use Ensures Smart Returns
Clock Gating: Smart Use Ensures Smart Returns
Clock Gating: Smart Use Ensures Smart Returns
« Back | Print
In current SOC designing, clock gating is one of the most effective and primitive power-saving techniques utilized to save dynamic functional power throughout the chip. In designs, clock gating is done
broadly at two different design-flow levels. At the RT level, you introduce clock gating into the architecture of the design. This clock gating ensures the switching off of the clock to a particular IP depending
upon the active and inactive states of that IP. At the synthesis stage, synthesis tools introduce automated clock-gating cells at a fine granular level depending upon the clock gating constraints provided by
the user to the tool. These synthesis constraints include defining the minimum and maximum number of registers in a register bank to be driven by a particular type of clock-gating cell.
This article targets the common erroneous practices that designers may use while implementing clock gating in SOCs. It details the problem that arises from these errors and also the method to counter these
problems early in the design flow.
Usually designers decide clock-gating strategy during synthesis. At this time they must decide on the type of clock-gating cells to be used. A number of clock-gating cells exist in libraries. For example:
Clock-gating cells with different threshold voltages, if you are using multi-Vt cell synthesis
Many of the points here are design-dependent, but ignoring factors such as choice of symmetrical cells, cells with a nonbuffered clock, or the different threshold voltages of cells can be dangerous. A brief
description of each issue follows.
Clock-gating cells come directly into the clock path, so the best-defined choice is a cell with the same delay for the rising and falling edge of the clock; that is, a symmetrical clock-gate cell.
Also, in most designs, you build the clock tree by balancing clocks in the same domain with the minimum possible skew. A nonbuffered clock-gating cell is preferred because it will have less cell delay as
compared with a buffered cell, and you can address transition requirements while fixing design-rule violations. This approach will definitely consume fewer buffers compared with the results of using buffered
clock-gating cells.
The third factor concerns the choice of threshold voltage of clock-gating cells. If your design requires a trade-off between leakage power and timing, we recommend an analysis of the number of extra clock
buffers required for the clock balancing with high-Vt clock-gating cells. If the design has multiple levels of clock-gating cells from clock source to the leaf flop level and there is enough talking between the
gated and ungated levels, then a low-Vt cell would be a definite choice, as this will improve timing and reduce on-chip variations—and all without much degradation in power because low gating-cell latency
implies fewer clock buffers will be needed.
The designer must also think about the minimum and maximum number of flops used per clock-gating cell. The answer to this question is a bit tricky. A clock-gating cell is inserted to reduce the power
consumption. Suppose we put in a clock-gating cell for gating the clock of a module comprising one or two flops. The area overhead and power overhead—both dynamic and leakage—would then be much
more than the power saved. Also, there will be a limit, decided by the drive strength of clock-gating cells, beyond which a large buffer is required to maintain the output slew. So the safest minimum number of
flops you can gate with one clock gate should be either 3 or 4 in a register bank. And usually 32 or 64 flops is the maximum limit.
The designer must also select the clock-gating test signals. Normal scan testing requires bypassing the clock-gating cells in the design. We do not want any gating of clocks during scan. We also require a
shift-enable signal to check the logic generating the enable of the clock-gating cell. So all the clock-gating cells usually have a test control signal. In addition, we need to flatten the synthesis-inserted clock-
gate cells at the time of logic equivalence checking between RTL and gate-level netlist, as these clock-gate cells do not exist in RTL. For this we constrain the clock-gating test signals. Because of this
requirement, the test-logic designer has to ensure separate test signals for both RTL and synthesis-inserted clock-gating cells.
Then there is the question of which modules should be gated. The designer decides explicitly on RTL clock gating, whereas at the time of synthesis, the synthesis tool decides the insertion of clock-gating
cells on the basis of two factors: switching activity and observable don't-care conditions. But sometimes the designer can judge exact activity at the architectural level itself. For example, if a particular module
is always active and its standby time is almost negligible, then there is no need to put synthesis-level clock gating in cells.
There is a catch here, though. Normally it's an assumption that if we don't put automated synthesis-level clock-gating cells in a module, we save area, since the cell-instance count of that module decreases.
But savings are entirely dependent upon the type of RTL used. Synthesis tools place clock-gating cells in the design by replacing the muxes in front of the register banks. So when we remove these clock-
gating cells, the muxes take their original place. For each clock-gating cell replaced, the number of muxes coming back into the picture is the same as the number of flops the clock-gating cell was gating.
Hence, we lose both area and power. Further, there exist some critical modules—for example, test-compression-logic generation modules or clock generation modules—where clock-gating cell insertion can
affect the functionality. So the decision about in which modules we should allow or suppress synthesis clock gating requires great investigation.
Now let's take a look at a couple of short case studies, in which we will describe a common design practice that leads to trouble and suggest an alternative.
By common practice, clock-gating cells have an enable pin on which setup and hold checks are done with respect to the clock pin. Usually until we run clock-tree synthesis, the timing violations at these
clock-gating cells are not visible because the clock path is treated as ideal. So we in effect assume that the clock on the flop that launches the enable signal is coincident with the clock coming into the clock-
gating cell. But as soon as we build clocks, the clock-gating violations at these cells pop up. The main reason is that the enable-launching flop's clock is not balanced with the clock going into the gating cell,
but instead to the clocks entering the flops at the fan-out of the clock-gating cells. This difference results in a skew, which has a minimum value equal to the delay of a clock-gating cell. With the information
added during clock-tree synthesis, new unoptimized paths are now visible.
To avoid this problem, we overconstrain these paths at synthesis time, and also follow up at the global physical synthesis level by putting extra uncertainties or latencies at the clock-gating cells. Hold
violation is not critical anyway because skew is negative here.
Here is another example. In our designs we normally implement a hierarchical clock-gating technique that puts multiple levels of clock-gating cells in the design. The multiple levels also exist because clocks
from the various sources (PLL, external oscillator, dividers) get distributed throughout the chip via a clock-distribution network that generates gated and nongated clocks for all individual IPs.
http://www.edn.com/article/print/457993-Clock_gating_Smart_use_ensures_smart_return... 11/19/2010
Clock gating: Smart use ensures smart returns - 2009-12-04 00:00:00 | EDN Page 2 of 4
As we mentioned earlier, there are at least two levels of clock-gating insertion in any design. The first is at the RTL stage at the module level, and the second is at the synthesis stage in the form of register-
bank-level insertion. So in the end, in designs we normally end up with three to four levels of clock gating (
). Every
clock-gating cell has around 200 ps (90-nm technology) of delay, so for the root-level flops there is approximately an 800-ps contribution of clock latency just by the clock-gating cells. Now if we have two
separate clock domains talking with each other—say bus clock with cpu clock—and both have four levels of clock-gating cells, then the clock-gating cells will contribute for 1600 ps of uncommon path. If we
now take a 10% derating factor for on-chip variations on both launch and capture, that adds 320 ps of on-chip variation slack in register-to-register paths.
http://www.edn.com/article/print/457993-Clock_gating_Smart_use_ensures_smart_return... 11/19/2010
Clock gating: Smart use ensures smart returns - 2009-12-04 00:00:00 | EDN Page 3 of 4
We propose to restructure this multilevel clock gating into a single-level linear clock-gating structure. We can do this easily by ANDing the enables of the clock-gating cells and providing clocks to all the clock-
gating cells in a clock path from a single source (
).
Module-level clock gating can appear intuitively obvious. And there is the temptation to just trust the tool when synthesis tools insert clock gating automatically. But innocence can lead to misfortune. We have
http://www.edn.com/article/print/457993-Clock_gating_Smart_use_ensures_smart_return... 11/19/2010
Clock gating: Smart use ensures smart returns - 2009-12-04 00:00:00 | EDN Page 4 of 4
pointed out some important issues that you must consider when you are thinking through your clock-gating strategies, and we have illustrated our points with a couple of real-life examples. We hope this
saves you some time and trouble on your next design.
« Back | Print
http://www.edn.com/article/print/457993-Clock_gating_Smart_use_ensures_smart_return... 11/19/2010