Superscalar Processors Questions
Superscalar Processors Questions
Superscalar Processors Questions
Review
Dependence graph
• Nodes: instructions
• Edges: ordered relations among the instructions
• Any ordering-based transformation that does not change
the dependencies of the program will be guarantied not
to change the result of the program.
• Example
• S1: Load R1, A / R1 Memory (A) /
• S2: Add R2, R1 / R2 R2+R1 /
S1 S2
Data dependency
• Example
S1: Load R1, A / R1 Memory (A) /
S2: Add R2, R1 / R2 R2+R1 /
S1 S2
S3: Move R1, R3 / R1 R3 /
S4: Store B, R1 / Memory (B) R3 /
S4 S3
EXAMPLE
• How long would the following sequence of instructions
take to execute on an superscalar processor with two
execution units, each of which can execute any
instruction? Load operations have a latency of two
cycles, and all other operations have a latency of one
cycle. Assume that the pipeline depth is 5 stages.
LD r1, (r2)
ADD r3, r1, r4
SUB r5, r6, r7
MUL r8, r9, r10
Example (cont.)
• In-order execution
• There are five pipeline stages and load has latency of 2
clock cycles
• Fetch, Decode, Execution, Memory access and Write
back are the pipeline stages
• Total number of cycles is 8
Example (cont.)
• Out-of-order execution
• There are five pipeline stages and load has latency of 2
clock cycles
• Fetch, Decode, Execution, Memory access and Write
back are the pipeline stages
• Total number of cycles is 7
• Solutions
Register renaming
• On an out-of-order superscalar processor with 8 execution units,
what is the execution time of the following sequence with and
without register renaming it any execution unit can execute any
instruction and the latency of all instructions is one cycle? Assume
that the hardware register file contains enough registers to remap
each destination register to a different hardware register and that
the pipeline depth is 5 stages.
• LD r7, (r8)
• MUL r1, r7, r2
• SUB r7, r4, r5
• ADD r9, r7, r8
• LD r8, (r12)
• DIV r10, r8, r10 .
Solution
• In this example, WAR dependencies are a significant limitation on parallelism, forcing
the DIV to issue 3 cycles after the first LD, for a total execution time of 8 cycles (the
MUL and the SUB can execute in parallel, as can the ADD and the second LD). After
register renaming, the program becomes
• LD hw7, (hw8)
• MUL hw1, hw7, hw2
• SUB hw17, hw4, hw5
• ADD hw9, hw17, hw8
• LD hw18, (hw 12)
• DIV hw10, hw18, hw10