Not An Exam Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

PART-1

Consider following loop written in Pseudo C code, which calculates the sum of the entries in a
matrix. Each element of the matrix is a word (32 bit integer)

sum = 0;
for (j = 0; j < 4; j++)
for (i = 0; i < 384; i++)
sum += A[i][j];

This matrix is stored contiguously in memory in row-major order. Assume that the cache is initially
empty. Also, assume that only accesses to matrix cause memory references and all other
necessary variables are stored in the registers. Instructions are in a separate instruction cache.
Data cache is of the size 4 KB and is designed to be a
2-way set associative cache. Main memory size is 4 MB. This data is common to all the
questions in this section ie. Question #2 (Q1.1), Question #3 (Q1.2) and Question #4
(Q1.3).

Q1.1 Assume that the block size (cache line) is 32 bytes. Based on this information, answer the
following questions. Show the calculations. Clearly show the final answer.
a) How many bits would be required for i) Tag ii) Index iii) Block offset.
[2 marks]
b) How many cache blocks would be present (in the given cache) in one set of a 2-way set
associative cache? [ 2 marks]
c) In the above mentioned cache, which blocks would be eligible to be placed in the top position
of any set in the cache (ie index = 000…)? Write down all the eligible blocks with the array
elements (data) contained in them. You may write A[0][0] as A0-0 for convenience. Your answer
should be of the form block 1: [Ax-a, Ax-b, …] etc. Write down all the array elements (data) in the
block [3 marks]
d) In the same way as explained in c), Write down all the blocks which would be eligible to be
placed in the bottom position of any set in the cache (ie index = 111…)? Write down all the
eligible blocks with the array elements (data) contained in them. Your answer should be of the
form block 1: [Ax-a, Ax-b, …] etc. Write down all the array elements (data) in the block [3 marks]
e) For the given Cache configuration and with “Least Recently Used” (LRU) replacement policy
employed
i) What would be the total miss count in the process to bring in the entire array due to the
“cold start miss”? Give brief explanation. [6 marks]
ii)What would be the total miss count due to the “conflict miss” (to bring in the entire array)?
Provide a brief explanation [6 marks]
f) For the same cache, if we change the replacement policy from the LRU to “Random” (ie In a 2-
way set associative cache, for the two blocks having the same index, any block can be thrown
away with equal probability to make space for the new (replacement) block. Based on this
information answer the following question;
i) What would the combined miss count (cold start + conflict) to bring in the entire array in the
best case scenario where the replacement of every block is to the advantage of lowering miss
count? (This is very similar to wishing that every coin-toss gives you the desired result!) [6
marks]
Q1.2 Now Assume that the block size (cache line) is reduced to 16 bytes – while keeping other
parameters same as explained earlier. Based on this information, answer the following questions;
a) How many bits would be required for i) Tag ii) Index iii) Block offset. Show calculations. [2
marks]
b) How many cache blocks would be present (in the given cache) in one set of a 2-way set
associative cache? [ 2 marks]
c) In the above mentioned cache, which blocks would be eligible to be placed in the top position
of any set in the cache (ie index = 000…)? Write down all the eligible blocks with the array
elements (data) contained in them. You may write A[0][0] as A0-0 for convenience. Your answer
should be of the form block 1: [Ax-a, Ax-b, …]. Write down all the array elements (data) in the
block [3 marks]
d) In the same way as explained in c), Write down all the blocks which would be eligible to be
placed in the bottom position of any set in the cache (ie index = 111…)? Write down all the
eligible blocks with the array elements (data) contained in them. You may write A[0][0] as A0-0 for
convenience. Your answer should be of the form block 1: [Ax-a, Ax-b, …]. Write down all the array
elements (data) in the block [3 marks]
e) For the given Cache configuration and with “Least Recently Used” (LRU) replacement policy
employed
Question 4:: 3 / 10
(i) What would be the total miss count in the process to bring in the entire array due to the
“cold start miss”? Give brief explanation. [6 marks]
(ii) What would be the total miss count due to the “conflict miss” (to bring in the entire array)?
Provide a brief explanation [6 marks]

Q1.3 Based on the understanding achieved from problem 1) and 2), to achieve minimum cold
start and conflict misses for this particular loop execution;
i) What would be the best cache block size in words? Explain in brief.
[5 marks]
ii) What would be the total misses (Conflict + cold start)? Explain in brief
[5 marks]
PART-2

We would like to change the conventional pipeline which has F-D-Ex-M-WB type of
structure. Now we are proposing a new pipeline with two ALUs;

ALU1 is in the 3rd pipeline stage (EX1) and ALU2 is in the 4th pipeline stage
(EX2/MEM). A memory instruction always uses ALU1 to compute its address. An
ALU instruction uses either ALU1 or ALU2, but never both. If an ALU
instruction’s operands are available (either from the register file or the
forwarding path) by the end of the ID stage, the instruction uses ALU1;
otherwise, the instruction uses ALU2. Assume that pipeline is fully bypassed (i.e. it
has required data forwarding paths and double edge triggered register memory file).
Assume that the pipeline is initially idle. Registers involved in inter-instruction
dependencies are highlighted in bold for your convenience.
Please use this data for solving Question 2.1 and Question 2.2.

Q2.1 For the following instruction sequence (or a program), indicate which ALU each instruction will
use. Select either ALU1
or ALU2.
[ 1 mark each]

Inst # Instruction ALU1 or ALU2 ?


1 add r1, r2, r3 alu1
2 ldr r4, [r1, #5] alu1
3 add r5, r4, r6 alu2
4 add r7, r5, r8 alu2
5 add r1, r2, r3 alu1
6 ldr r4, [r1, #4] alu1
7 add r5, r1. r6 alu1

Q2.2.1 Indicate whether each of the following instruction sequences causes a stall in the given
pipeline. Consider each sequence
independently and assume that the
pipeline is initially idle. Your answer should be either yes or no.
[2 marks each, -1 mark for a wrong answer]
Instruction sequences

add r1, r2, r3


ldr r4, [r1, #5]

Ans : NO
2

ldr r1, [r2]


add r3, r1, r4
ldr r5, [r1]

Ans : NO

ldr r1, [r2]


ldr r3, [r1]

Ans : YES

ldr r1, [r2]


str r1, [r3]

Ans : NO

ldr r1, [r2]


add r3, r1, r4
str r5, [r3]

Ans : YES

ldr r1, [r2]


add r3, r1, r4

Ans : NO

Q3.Consider following piece of code;


[Total marks 4, -2 for a wrong answer]
for (i = 0; i < 1000000; i++) {
a = random(100)
if (a >= 50) {
…..
} }
Assume that random(N) returns a random number uniformly distributed between 0 to
N-1. For the given branch statement, what kind of branch predictor would be the
cheapest and give best results? Write down only the name of the predictor (as we
have discussed in the class) Your options would be Forward branch taken &
backward branch not taken
4 This is a short answer type of question (will be manually graded)
Consider a five stage pipeline (F, D, Ex, M, WB) with full data forwarding. Assume that each stage
takes 10ns to finish. Ignore
setup and hold time of the registers.
Assume that you are executing a program where a fraction “f” of all instructions immediately follow a
load upon which they are
dependent. Based on this information, answer following questions.
a. With forwarding enabled and ignoring the cycles required to initially fill the pipeline, what would be
the total execution time for
N instruction in terms of “f”? [4marks]
b. Now consider a scenario that for some reason, M (memory) stage requires 12ns. This leaves us with
two options;
Option 1: if we want to run the pipeline as fast as given in the original problem statement then divide
memory stage into two
parts – M1 and M2. This would increase the pipeline stages to 6.
Option 2: Slow down the whole pipeline so that now each stage is
allotted 12ns. In this option, the number of stages remains unchanged at 5
For a program mix as explained in the main problem statement, when is the first option better than
second option? Your answer
should be in terms of “f” [8 marks]

Q.5 Short answer type of question [ will be manually graded]


For a 2 bit pattern history associated with a 1 bit branch predictor, identify a shortest
pattern of taken and not taken branches that when repeated forever will result in a
0% prediction hit rate (again, assuming no aliasing of any kind). Your answer should
be of a form of some pattern of T and N. Explain your answer. [5 marks]

You might also like