CS152 Quiz #2: Name: - This Is A Closed Book, Closed Notes Exam. 80 Minutes 9 Pages
CS152 Quiz #2: Name: - This Is A Closed Book, Closed Notes Exam. 80 Minutes 9 Pages
CS152 Quiz #2: Name: - This Is A Closed Book, Closed Notes Exam. 80 Minutes 9 Pages
Name:_____ ____
This problem evaluates the cache performance of the following C code, which transposes
a square matrix A, placing the result in another matrix B.
#define N 1024
int A[N*N], B[N*N];
int i,j;
Assume A and B are both aligned to a 4KB boundary and are contiguous in memory.
ints are 32 bits (4 bytes).
This problem is simplified by LRU replacement: accesses to A and B will not conflict
with each other in the cache. Thus, the store and load miss rates can be calculated
independently of each other.
Loads are unit-stride. Every eighth load will cause a compulsory miss, and the
intervening seven loads will hit. (Spatial locality is fully exploited.) The load miss rate
is thus 1/8.
Stores have a 4KB stride. The first outer-loop iteration, all stores are compulsory misses.
After that, all stores incur capacity misses before spatial locality can be exploited. The
store miss rate is thus 100%.
NAME: ___________________________
In the vicinity of the matrix’s diagonal, B’s accesses will map to the same set as A’s
accesses. While LRU replacement kept A’s accesses in the cache, FIFO will cause a
cache miss on every second load in this situation. The result is three load conflict misses
every outer loop iteration, resulting in a slightly higher load miss rate of (128+3)/1024.
1) Write-Allocate, Write-Back
2) Write-Allocate, Write-Through (with a write buffer)
3) No Write-Allocate, Write-Back
4) No Write-Allocate, Write-Through (with a write buffer)
No Write-Allocate, Write-Through is the best configuration. Recall that all stores miss
and there are no loads from the lines that are stored-to. Then, consider the amount of
traffic between the L1 and L2 caches in each scenario. For every store, WA caches will
cause 32 bytes of read traffic that is simply discarded. WA+WB caches will cause 32
bytes of write traffic on every store—a total of 64 bytes of L1<->L2 traffic for a 4 byte
store. WA+WT caches reduce this to 36 bytes, but the allocation traffic is still wasteful.
NWA+WB caches behave similarly to NWA+WT: stores never hit, so writebacks never
occur; thus, only 4 bytes are transferred to the L2 for every store in this code. But the
lack of a write buffer makes the miss penalty much greater for NWA+WB than for
NWA+WT, so the latter option is best.
NAME: ___________________________
#define N 1024
#define B 8
int A[N*N], B[N*N];
int i,j,k;
Credit was given to any strategy that eliminated non-compulsory misses without
changing the semantics of the code. Source code was not required; a description of the
blocking strategy was sufficient.
Each matrix is 4MB, so 2x4MB/4KB = 2048 compulsory misses occur. Let A k refer to
matrix A, row k (which occupies exactly one TLB entry). After some outer iteration i, 0
≤ i < N-1, TLB[k] contains Bk, k≠i, and TLB[i] contains Ai. So, during outer iteration
i+1, the store to Bi will miss, then the store to B i+1 will miss (since TLB[i+1] now
contains Ai+1). Then, the next load to Ai+1 will miss. So every iteration, 3 conflict misses
occur (3072 total), for a total of 5120 including compulsory misses. (Actually, only one
conflict occurs the zeroth iteration and two the last iteration, so the total is 5117, but this
is a detail.)
[28 points]
In this problem, we explore microtagging, a technique to reduce the access time of set-
associative caches. Recall that for associative caches, the tag check must be completed
before load results are returned to the CPU, because the result of the tag check determines
which cache way is selected. Consequently, the tag check is often on the critical path.
The time to perform the tag check (and, thus, way selection) is determined in large part
by the size of the tag. We can speed up way selection by checking only a subset of the
tag—called a microtag—and using the results of this comparison to select the appropriate
cache way. Of course, the full tag check must also occur to determine if the cache access
is a hit or a miss, but this comparison proceeds in parallel with way selection. We store
the full tags separately from the microtag array.
We will consider the impact of microtagging on a 4-way set-associative 16KB data cache
with 32-byte lines. Addresses are 32 bits long. Microtags are 8 bits long. The baseline
cache (i.e. without microtagging) is depicted in Figure H2-B in the attached handout.
Figure 1, below, shows the modified tag comparison and driver hardware in the
microtagged cache.
NAME: ___________________________
Assume that the 2-input AND gates have a 50ps delay and the 4-input OR gate has a
100ps delay.
State which of the 3C’s of cache misses this constraint affects. How will the cache miss
rate compare to an ordinary 4-way set-associative cache? How will it compare to that of
a direct-mapped cache of the same size?
NAME: ___________________________
Without further modifications, it is possible for aliasing to occur in this cache. Carefully
explain why this is the case.
Changing from a
physical cache to a
virtual cache
Adding a write
buffer to a write-
through cache
Adding hardware
prefetching
10
END OF QUIZ
11
CS152 Spring 2010
Handout #2
CS152
Computer Architecture and Engineering
Cache Implementations Last Updated:
3/11/2010 11:28 PM
http://inst.eecs.berkeley.edu/~cs152/sp10
Direct-mapped Cache
The following diagram shows how a direct-mapped cache is organized. To read a word from the
cache, the input address is set by the processor. Then the index portion of the address is decoded
to access the proper row in the tag memory array and in the data memory array. The selected tag
is compared to the tag portion of the input address to determine if the access is a hit or not. At the
same time, the corresponding cache block is read and the proper line is selected through a MUX.
Tag Index
. . .
Tag Status 2b-2 data words
• • Tag Data • •
• • Decoder Decoder • . . . •
• • • •
. . .
. . .
Valid Bit MUX
Comparator Data
Output Driver
Valid
Output Driver
In the tag and data array, each row corresponds to a line in the cache. For example, a row in the
tag memory array contains one tag and two status bits (valid and dirty) for the cache line. For
direct-mapped caches, a row in the data array holds one cache line.
12
CS152 Spring 2010
Handout #2
The implementation of a 4-way set-associative cache is shown in the following diagram. (An n-
way set-associative cache can be implemented in a similar manner.) The index part of the input
address is used to find the proper row in the data memory array and the tag memory array. In this
case, however, each row (set) corresponds to four cache lines (four ways). A row in the data
memory holds four cache lines (for 32-bytes cache lines, 128 bytes), and a row in the tag
memory array contains four tags and status bits for those tags (2 bits per cache line). The tag
memory and the data memory are accessed in parallel, but the output data driver is enabled only
if there is a cache hit.
Tag Index
.. .. .. ..
T S T S T S T S 4× 2 b -2 d ata w ord s
• • • • • • • • Tag D ata •••• •••• •••• ••••
• • • • • • • • De cod er De cod e r •••• •••• •••• ••••
• • • • • • • • •••• •••• •••• ••••
.. .. .. ..
V alid Bit
.. .. .. ..
= = = = MUX MUX MUX MUX
Com parator
Buffe r Drive r
V alid
Output Driver
13