Guidelines For Design Synthesis Using Synopsys Design Compiler Design Synthesis
Guidelines For Design Synthesis Using Synopsys Design Compiler Design Synthesis
Guidelines For Design Synthesis Using Synopsys Design Compiler Design Synthesis
Guidelines
For Introduction
Design Synthesis
Using One of the most important steps in ASIC design is the synthesis phase. Synthesis
is an automatic method of converting a higher level of abstraction to a lower level of
Synopsys Design Compiler abstraction. In other words the synthesis process converts Register Transfer Level (RTL)
descriptions to gate-level netlists. These gate-level netlists can be optimized for area,
speed, testability, etc. The synthesis process is shown in Fig 1.0.
Synthesis
Gate-level
Netlists
The inputs to the synthesis process are RTL HDL description, circuit constraints
and attributes for the design, and a technology library. The synthesis process produces an
optimized gate-level netlist from all these inputs. Synthesizing a design is an iterative
process and begins with defining the constraints for each block of the design. In addition
to these constraints, a file defining the synthesis environment is also needed. The
environment file specifies the technology cell libraries and other relevant information that
the tool uses during synthesis.
• search_path Pin: It corresponds to the inputs, outputs or IO’s of the cells in the design. (Note the
This parameter is used to specify the synthesis tool all the paths that it should difference between port and pin)
search when looking for a synthesis technology library for reference during
synthesis. Net: These are the signal names, i.e., the wires that hook up the design together by
• target_library connecting ports to pins and/or pins to each other.
The parameter specifies the file that contains all the logic cells that should used
for mapping during synthesis. In other words, the tool during synthesis maps a Clock: The port or pin that is identified as a clock source. The identification may be
design to the logic cells present in this library. internal to the library or it may be done using dc_shell commands.
• symbol_library
This parameter points to the library that contains the “visual” information on the Library: Corresponds to the collection of technology specific cells that the design is
logic cells in the synthesis technology library. All logic cells have a symbolic targeting for synthesis; or linking for reference.
representation and information about the symbols is stored in this library.
• link_library
This parameter points to the library that contains information on the logic gates in Design Entry
the synthesis technology library. The tool uses this library solely for reference but
does not use the cells present in it for mapping as in the case of target_library. Before synthesis, the design must be entered into the Design Compiler or Design
Analyzer (referred to as DC/DA from now on) in the RTL format. DC/DA provides the
An example on use of these four variables from a .synopsys_dc.setup file is given below. following two methods of design entry:
if A = ‘1’ then
E = B + C; E
else
E = B + D; Figure 3 (b). With resource allocation.
end if;
It is clear from the figure that one ALU has been removed with one ALU being
The above code would generate two ALUs one for the addition of B+C and other for the shared for both the addition operations. However a multiplexer is induced at the inputs of
addition B + D which are executed under mutually exclusive conditions. Therefore a the ALU that contributes to the path delay. Earlier the timing path of the select signal
single ALU can be shared for both the additions. The hardware synthesized for the above goes through the multiplexer alone, but after resource sharing it goes through the
code is given below in Figure 3 (a). multiplexer and the ALU datapath, increasing its path delay. However due to resource
sharing the area of the design has decreased. This is therefore a trade-off that the designer
B C B D may have to make. If the design is timing-critical it would be better if no resource sharing
is performed.
B := R1 + R2;
…..
C <= R3 – (R1 + R2);
A MUX
Here the subexpression R1 + R2 in the signal assignment for C can be replaced by
B as given below. This might generate only one adder for the computation instead of two.
E
Figure 3 (a). Without resource allocation. C <= R3 – B;
Common factoring is the extraction of common subexpressions in mutually-exclusive C := A + B;
branches of an if or case statement. …………
temp := C – 6; // A temporary variable is introduced
if (test)
A <= B & (C + D); for c in range 0 to 5 loop
else ……………
J <= (C + D) | T; T := temp;
end if; // Assumption : C is not assigned a new value within the loop, thus the above
expression would remain constant on every iteration of the loop.
In the above code the common factor C + D can be place out of the if statement, which ……………
might result in the tool generating only one adder instead of two as in the above case. end loop;
temp := C + D; // A temporary variable introduced. Constant folding and Dead code elimination
if (test) The are possibilities where the designer might leave certain expressions which are
A <= B & temp; constant in value. This can be avoided by computing the expressions instead of the
else implementing the logic and then allowing the logic optimizer to eliminate the additional
J <= temp | T; logic.
end if; Ex:
C := 4;
Such minor changes if made by the designer can cause the tool to synthesize ….
better logic and also enable it to concentrate on optimizing more critical areas. Y = 2 * C;
Computing the value of Y as 8 and assigning it directly within your code can avoid the
Moving Code above unnecessary code. This method is called constant folding.
In certain cases an expression might be placed, within a for/while loop statement,
whose value would not change through every iteration of the loop. Typically a synthesis The other optimization, dead code elimination refers to those sections of code, which are
tool handles the a for/while loop statement by unrolling it the specified number of times. never executed.
In such cases redundant code might be generated for that particular expression causing
additional logic to be synthesized. This could be avoided if the expression is moved Ex.
outside the loop, thus optimizing the design. Such optimizations performed at a higher A := 2;
level, that is, within the model, would help the optimizer to concentrate on more critical B := 4;
pieces of the code. An example is given below. if(A > B) then
……
C := A + B; end if;
…………
for c in range 0 to 5 loop The above if statement would never be executed and thus should be eliminated from the
…………… code.
T := C – 6; The logic optimizer performs these optimizations by itself, but nevertheless if the
// Assumption : C is not assigned a new value within the loop, thus the above designer optimizes the code accordingly the tool optimization time would be reduced
expression would remain constant on every iteration of the loop. resulting in faster tool running times.
……………
end loop; Flip-flop and Latch optimizations
The above code would generate six subtracters for the expression when only one Earlier in the RTL code section, it has been described how flip-flops and latches
is necessary. Thus by modifying the code as given below we could avoid the generation are inferred through the code by the synthesis tool. However there are only certain cases
of unnecessary logic. where the inference of the above two elements is necessary. The designer thus should try
to eliminate all the unnecessary flip-flop and latch elements in the design. Placing only
the clock sensitive signals under the edge sensitive statement can eliminate the
unnecessary flip-flops. Similarly the unwanted latches can be avoided by specifying the
values for the signals under all conditions of an if/case statement. It is clear that after using the parentheses the timing path for the datapath has been
reduced as it does not need to go through one more ALU as in the earlier case.
Using Parentheses.
The usage of parentheses is critical to the design as the correct usage might result Partitioning and structuring the design.
in better timing paths. A design should always be structured and partitioned as it helps in reducing
Ex. design complexity and also improves the synthesis run times since it smaller sub blocks
Result <= R1 + R2 - P + M; synthesis synthesize faster. Good partitioning results in the synthesis of a good quality
design. General recommendations for partitioning are given below.
The hardware generated for the above code is as given below in Figure 4 (a).
Keep related combinational logic in the same module
R1 R2
Partition for design reuse.
Separate modules according to their functionality.
ALU(+)
Separate structural logic from random logic.
P Limit a reasonable block size (perhaps a maximum of 10K gates per block).
Partition the top level.
Do not add glue-logic at the top level.
ALU(-) Isolate state-machine from other logic.
M Avoid multiple clocks within a block.
Isolate the block that is used for synchronizing the multiple clocks.
ALU(+)
Optimization using Design Compiler/Design Analyzer
Result For the optimization of design, to achieve minimum area and maximum speed, a
lot of experimentation and iterative synthesis is needed. The process of analyzing the
Figure 4 (a) Without using parentheses design for speed and area to achieve the fastest logic with minimum area is termed –
design space exploration.
If the expression has been written using parentheses as given below, the hardware For the sake of optimization, changing of HDL code may impact other blocks in the
synthesized would be as given in Figure 4 (b). design or test benches. For this reason, changing the HDL code to help synthesis is less
desirable and generally is avoided. It is now the designer’s responsibility to minimize the
Result <= (R1 + R2) – (P - M); area and meet the timing requirements through synthesis and optimization. The later
R1 R2 P M versions of DC, starting from DC98 have their compile flow different from previous
versions. In the DC98 and later versions the timing is prioritized over area. Another
difference is that DC98 performs compilation to reduce “total negative slack” instead of
ALU(+) ALU(-) “worst negative slack”. This ability of DC98 produces better timing results but has some
impact on area. Also DC98 requires designers to specify area constraints explicitly as
opposed to the previous versions that automatically handled area minimization. Generally
some area cleanup is performed but better results are obtained when constraints are
specified.
ALU(-)
The DC has three different compilation strategies. It is up to user discretion to
choose the most suitable compilation strategy for a design.
DATA invalid valid DATA valid invalid 3) Architectural changes: This is the last option as the designer needs to change the
whole architecture of the design under consideration and would take up a long time.
Setup time > 2 ns Hold time > 1 ns Optimization using synthesis tool
The tool can be used to tweak the design for improving performance. A designer
Figure 5(a) Timing diagram for Figure 5(b) Timing diagram for for performance optimization can employ the following ways.
setup time on DATA hold time on DATA
a) Compilation with a map_effort high option;
b) Group critical paths together and give them a weight factor;
The synthesis tool automatically runs its internal static timing analysis engine to check c) Register balancing;
for setup and hold time violations for the paths, that have timing constraints set on them. d) Choose a specific implementation for a module;
It mostly uses the following two equations to check for the violations. e) Balancing heavy loading.
Tprop + Tdelay < Tclock - Tsetup (1) Compilation with a map_effort high
Tdelay + Tprop > Thold (2) The initial compilation of a design is done with map_effort as medium when
employing design constraints. This usually gives the best results with flattening and
Here Tprop is the propagation delay from input clock to output of the device in question structuring options. In case the desired results are not met i.e. the design generates some
(mostly a flip-flop); Tdelay is the propagation delay across the combinational logic timing violations then the map_effort of high can be set. This usually takes a long time
through which the input arrives; Tsetup is the setup time requirement of the device; to run and thus is not used as the first option. This compilation could improve design
Tclock is clock period; Thold the hold time requirement of the device. performance by about 10%
So if the propagation delay across the combinational logic, Tdelay is such that the
equation (1) fails i.e. Tprop + Tdelay is more than Tclock – Tsetup then a setup timing Group critical paths and assign a weight factor
violation is reported. Similarly if Tdelay + Tprop is greater than Thold then a hold timing We can use the group_path command to group critical timing paths and set a
violation is reported. In the case of the setup violation the input data arrives late due to weight factor on these critical paths. The weight factor indicates the effort the tool needs
to spend to optimize these paths. Larger the weight factor the more the effort. This The following methods can be used for this purpose.
command allows the designer to prioritize the critical paths for optimization using the a) Logic duplication to generate independent paths
weight factor. b) Balancing of logic between flip-flops
group_path –name <group_name> -from <starting_point> -to <ending_point> -weight c) Priority decoding versus multiplex decoding
<value>
Logic duplication to generate independent paths
Register balancing
This command is particularly useful with designs that are pipelined. The Consider the figure 5(a). Assuming a critical path exists from A to Q2, logic optimization
command reshuffles the logic from one pipeline stage to another. This allows extra logic on combinational logic X, Y, and Z would be difficult because X is shared with Y and Z.
to be moved away from overly constrained pipeline stages to less constrained ones with We can duplicate the logic X as shown in figure 5(b). In this case Q1 and Q2 have
additional timing. The command is simply balance_registers. independent paths and the path for Q2 can be optimized in a better fashion by the tool to
ensure better performance.
Choose a specific implementation for a module
A synthesis tool infers high-level functional modules for operators like ‘+’, ‘-’,
‘*’, etc.. . however depending upon the map_effort option set, the design compiler would
choose the implementation for the functional module. For example the adder has the Q1
Y
following kinds of implementation. A
X
a) Ripple carry – rpl B
b) Carry look ahead –cla
c) Fast carry look ahead –clf
d) Simulation model –sim
C Z Q2
The implementation type sim is only for simulation. Implementation types rpl, cla,
and clf are for synthesis; clf is the faster implementation followed by cla; the slowest
being rpl. Figure 5(a). Logic with Q2 critical path.
If compilation of map_effort low is set the designer can manually set the
implementation using the set_implementation command. Otherwise the selection will
not change from current choice. If the map_effort is set to medium the design compiler A
would automatically choose the appropriate implementation depending upon the Q1
B X+Y
optimization algorithm. A choice of medium map_effort is suitable for better optimization
or even a manual setting can be used for better performance results.
Microarchitectural Tweaks Logic duplication can also be used in cases where a module has one signal
The design can be modified for both setup timing violations as well as hold timing arriving late compared to other signals. The logic can be duplicated in front of the fast -
violations. Lets deal with setup timing violations. arriving signals such that timing of all the signals is balanced. Figure 6(a)&(b) illustrate
When a design with setup violations cannot be fixed with tool optimizations the this fact quite well. The signal Q might generate a setup violation as it might be delayed
code or microarchitectural implementation changes should be employed. due to the late-arriving select signal of the multiplexer. The combinational logic present
at the output could be put in front of the inputs (fast arriving). This would cause the delay
due the combinational logic to be used appropriately to balance the timing of the inputs Priority encoding versus multiplex encoding
of the multiplexer and thus avoiding the setup violation for Q. When a designer knows for sure that a particular input signal is arriving late then
priority encoding would be a good bet. The signals arriving earlier could be given more
priority and thus can be encoded before the late arriving signals.
A Consider the boolean equation:
Q
C
Q = A.B.C.D.E.F
B
sel It can be designed using five and gates with A, B at the first gate. The output of
first gate is anded with C and output of the second gate with D and so on. This would
ensure proper performance if signal F is most late arriving and A is the earliest to arrive.
Figure 6(a). Multiplexer with late-arriving sel signal If propagation delay of each and gate were 1 ns this would ensure the output signal Q
would be valid only 5 ns after A is valid or only 1 ns after signal H is valid.
Multiplex decoding is useful if all the input signals arrive at the same time. This
A C would ensure that the output would be valid at a faster rate. Thus multiplex decoding is
Q faster than priority decoding if all input signals arrive at the same time. In this case for
the boolean equation above the each of the two inputs would be anded parallely in the
B C form of A.B, C.D and E.F each these outputs would then be anded again to get the final
output. This would ensure Q to be valid in about 2 ns after A is valid.
sel
Figure 6(a). Logic-duplication for balancing of timing between the signals. Fixing Hold time violations
Hold time violations occur when signals arrive to fast causing them to change
before they are read in by the devices. The best method to fix paths with hold time
Balancing of logic between flip-flops violations is to add buffers in those paths. The buffers generate additional delay slowing
This concept is similar to the balance_registers command we have come across the path considerably. One has to careful while fixing hold time violations. Too many
in the Tool optimization section. The difference is that the designer does this at the code- buffers would slow down the signal a lot and might result in setup violations which is a
level. To fix setup violations in designs using pipeline stages the logic between each stage problem again.
should be balanced. Consider a pipeline stage consisting of three flip-flops and two
combinational logic modules in between each flip-flop. If the delay of the first logic
module is such that it violates the setup time of the second flip-flop by a large margin and
the delay of the second logic module is so less that the data on the third flip-flop is
comfortably meeting the setup requirement. We can move part of the first logic module to
the second logic module so that the setup time requirement of both the flip-flops is met.
This would ensure better performance without any violations taking place. Figure 7
illustrates the example.
D Q X D Q Y D Q
clock