Final Exam - Fall 2008: COE 308 - Computer Architecture
Final Exam - Fall 2008: COE 308 - Computer Architecture
Final Exam - Fall 2008: COE 308 - Computer Architecture
Student ID:
Q1 / 20 Q2 / 15
Q3 / 20 Q4 / 20
Q5 / 25
Total / 100
Q1. (20 points) True or False? Explain or give the right answer for a full mark.
a) On a read, the value returned by the cache depends on which blocks are in the cache.
False, it depends on the last value written to the same memory location
False, most of the cost is at the L2 cache (takes more area on chip)
c) The higher the memory bandwidth, the larger the cache block size should be.
True, if the memory bandwidth is high then a larger block size can be transferred in
same amount of time
True, increasing the capacity eliminates more cache misses than higher associativity
f) Allowing ALU and branch instructions to take fewer stages and complete earlier than
other instructions does not improve the performance of a pipeline.
False, not always, at some point pipeline register delays become significant, and
more bubbles or stall cycles must be introduced if the pipeline depth is increased
h) The single-cycle datapath must have separate instruction and data memories because
the format of instructions and data is different.
False reason, it is because both memories should be accessed during the same cycle
and both are single ported.
i) A given application runs in 15 seconds. A new compiler is released that requires only
0.6 as many instructions as the old compiler. Unfortunately, it increases the CPI by
1.1. We expect the application to run using this new compiler in 15×0.6/1.1 = 8.18 sec
j) If Computer A has a higher MIPS rating than computer B, then A is faster than B.
False, it is possible to have higher MIPS rating and worse execution time.
Page 3 of 8
Q2. (15 pts) Consider a direct-mapped cache with 128 blocks. The block size is 32 bytes.
a) (3 pts) Find the number of tag bits, index bits, and offset bits in a 32-bit address.
Offset bits = 5
Index bits = 7
b) (4 pts) Find the number of bits required to store all the valid and tag bits in the cache.
Total number of tag and valid bits = 128 * (20 + 1) = 2688 bits
Starting with an empty cache, show the index and tag for each address and indicate
whether a hit or a miss.
b) (3 pts) What is the average memory access time for data access in clock cycles?
AMAT = 1 + 0.05 * 100 = 6 clock cycles
c) (4 pts) What is the number of stall cycles per instruction and the overall CPI?
Stall cycles per instruction = 1 * 0.02 * 100 + 0.3 * 0.05 * 100 = 3.5 cycles
Overall CPI = 1.2 + 3.5 = 4.7 cycles per instruction
Suppose we add now an L2 cache that has a hit time of 5 ns, which is the time to access
and transfer a block between the L2 and the L1 cache. Of all the memory references sent
to the L2 cache, 80% are satisfied without going to main memory.
d) (4 pts) What is the average memory access time for instruction access in clock cycles?
Hit time in the L2 cache = 5 ns * 2 GHz = 10 clock cycles
AMAT = 1 + 0.02 * (10 + 0.2 * 100) = 1.6 clock cycles
e) (4 pts) What is the number of stall cycles per instruction and the overall CPI?
Stall cycles per instruction = (1 * 0.02 + 0.3 * 0.05) * (10 + 0.2 * 100) = 1.05 cycles
Overall CPI = 1.2 + 1.05 = 2.25 cycles per instruction
f) (2 pts) How much faster will the machine be after adding the L2 cache?
Start 0
64
575
End 999
a) (6 pts) What is the total instruction count and how many I-cache misses are caused by
the program?
0 → 63: 64 instruction = 16 I-blocks => 16 I-cache misses
64 → 127: 64 instructions = 16 I-blocks => 16 I-cache misses
128 → 255: 128 instructions = 32 I-blocks => 32 I-cache misses (all 20 iterations)
Inner loop can fit inside I-cache (only first pass causes I-cache misses)
256 → 575: 320 instructions = 80 I-blocks => 80 I-cache misses
Outer loop I-cache misses = 10 * (16 + 32 + 80) = 1280
576 → 999: 424 instruction = 106 I-blocks => 106 I-cache misses
Total I-cache misses = 16 + 1280 + 106 = 1402
c) (4 pts) If only cache misses stall the processor, what is the execution time (in
nanoseconds) of the above program on a 2 GHz pipelined processor?
Total clock cycles = 29,928 + 1402 * 20 = 57,968 cycles
Execution time = 57,968 * 0.5 ns = 28,984 ns
d) (8 pts) Repeat (a) thru (c) if a bigger block size that can store 8 instructions is used. The
total number of blocks in the I-cache is still 64. What is the speedup factor?
0 → 63: 64 instruction = 8 I-blocks => 8 I-cache misses
64 → 127: 64 instructions = 8 I-blocks => 8 I-cache misses
128 → 255: 128 instructions = 16 I-blocks => 16 I-cache misses (all 20 iterations)
Inner loop can fit inside I-cache. Only first pass causes I-cache misses
256 → 575: 320 instructions = 40 I-blocks => 40 I-cache misses
Outer loop I-cache misses = 8 + 16 + 40 = 64
Outer loop can fit inside I-cache. Only first pass causes I-cache misses
576 → 999: 424 instruction = 53 I-blocks => 53 I-cache misses
Total I-cache misses = 8 + 64 + 53 = 125
a) (10 pts) Show the timing of one loop iteration on the 5-stage MIPS pipeline without forwarding hardware. Complete the timing table, showing
all the stall cycles. Assume that the branch will stall the pipeline for 1 clock cycle only.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I1: ADDI IF ID EX M WB
I2: ADD IF ID EX M WB
I3: LW IF ID EX M WB
I4: ADD IF stall stall ID EX M WB
I5: LW 2 stall cycles IF ID EX M WB
I6: SUB IF stall stall ID EX M WB
I7: ADDI 2 stall cycles IF ID EX M WB
I8: ADDI IF ID EX M WB
I9: ADDI IF ID EX M WB
I10: BNE IF stall stall ID
I3: LW 2 stall cycles IF IF ID EX M WB
I4: ADD 1 delay cycle IF stall stall ID EX M WB
b) (5 pts) According to the timing diagram of part (a), compute the number of clock cycles
and the average CPI to execute ALL the iterations of the above loop.
c) (5 pts) Reorder the instructions of the above loop to fill the load-delay and the branch-
delay slots, without changing the computation. Write the code of the modified loop.
d) (5 pts) Compute the number of cycles and the average CPI to execute ALL the iteration
of the modified loop. What is the speedup factor?