ACA Microprocessor and Thread Level Parallelism
ACA Microprocessor and Thread Level Parallelism
ACA Microprocessor and Thread Level Parallelism
com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Chapter - 5
Multiprocessors and Thread-Level Parallelism
We have seen the renewed interest in developing multiprocessors in early 2000:
- The slowdown in uniprocessor performance due to the diminishing returns in exploring
instruction-level parallelism.
- Difficulty to dissipate the heat generated by uniprocessors with high clock rates.
- Demand for high-performance servers where thread-level parallelism is natural.
For all these reasons multiprocessor architectures has become increasingly attractive.
PU Processing Unit
Uniprocessors
2.Single instruction stream, multiple data streams
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
3.Multiple
instruction
streams,
single
data
stream
(MISD)No
commercial
multiprocessor of this type has been built to date, but may be in the future. Some special
purpose stream processors approximate a limited form of this (there is only a single data
stream that is operated on by successive functional units).
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
1.
MIMDs offer flexibility. With the correct hardware and software support, MIMDs
MIMDs
can
build
on
the
cost/performance
advantages
of
off-the-shelf
microprocessors. In fact, nearly all multiprocessors built today use the same
microprocessors found in workstations and single-processor servers.
With an MIMD, each processor is executing its own instruction stream. In many cases,
each processor executes a different process. Recall from the last chapter, that a process is
an segment of code that may be run independently, and that the state of the process
contains all the information necessary to execute that program on a processor. In a
multiprogrammed environment, where the processors may be running independent tasks,
each process is typically independent of the processes on other processors.
It is also useful to be able to have multiple processors executing a single program and
sharing the code and most of their address space. When multiple processes share code
and data in this way, they are often called threads
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
. Today, the term thread is often used in a casual way to refer to multiple loci of
execution that may run on different processors, even when they do not share an address
space. To take advantage of an MIMD multiprocessor with n processors, we must
usually have at least n threads or processes to execute. The independent threads are
typically identified by the programmer or created by the compiler. Since the parallelism
in this situation is contained in the threads, it is called thread-level parallelism.
Threads may vary from large-scale, independent processesfor example, independent
programs running in a multiprogrammed fashion on different processors to parallel
iterations of a loop, automatically generated by a compiler and each executing for
perhaps less than a thousand instructions. Although the size of a thread is important in
considering how to exploit thread-level parallelism efficiently, the important qualitative
distinction is that such parallelism is identified at a high-level by the software system and
that the threads consist of hundreds to millions of instructions that may be executed in
parallel. In contrast, instruction level parallelism is identified by primarily by the
hardware, though with software help in some cases, and is found and exploited one
instruction at a time.
Existing MIMD multiprocessors fall into two classes, depending on the number of
processors involved, which in turn dictate a memory organization and interconnect
strategy. We refer to the multiprocessors by their memory organization, because what
constitutes a small or large number of processors is likely to change over time.
The first group, which we call
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Centralized shared memory architectures have at most a few dozen processors in 2000.
For multiprocessors with small processor counts, it is possible for the processors to share
a single centralized memory and to interconnect the processors and memory by a bus.
With large caches, the bus and the single memory, possibly with multiple banks, can
satisfy the memory demands of a small number of processors. By replacing a single bus
with multiple buses, or even a switch, a centralized shared memory design can be scaled
to a few dozen processors. Although scaling beyond that is technically conceivable,
sharing a centralized memory, even organized as multiple banks, becomes less attractive
as the number of processors sharing it increases.
Because there is a single main memory that has a symmetric relationship to all processors
and a uniform access time from any processor, these multiprocessors are often called
symmetric (shared-memory) multiprocessors ( SMPs), and this style of architecture is
sometimes called UMA for uniform memory access. This type of centralized sharedmemory architecture is currently by far the most popular organization.
The second group consists of multiprocessors with physically distributed memory. To
support larger processor counts, memory must be distributed among the processors rather
than centralized; otherwise the memory system would not be able to support the
bandwidth demands of a larger number of processors without incurring excessively long
access latency. With the rapid increase in processor performance and the associated
increase in a processors memory bandwidth requirements, the scale of multiprocessor for
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Distributed-memory multiprocessor
Distributing the memory among the nodes has two major benefits. First, it is a costeffective way to scale the memory bandwidth, if most of the accesses are to the local
memory in the node. Second, it reduces the latency for accesses to the local memory.
These two advantages make distributed memory attractive at smaller processor counts as
processors get ever faster and require more memory bandwidth and lower memory
latency. The key disadvantage for a distributed memory architecture is that
communicating data between processors becomes somewhat more complex and has
higher latency, at least when there is no contention, because the processors no longer
share a single centralized memory. As we will see shortly, the use of distributed memory
leads to two different paradigms for interprocessor communication.
Typically, I/O as well as memory is distributed among the nodes of the multiprocessor,
and the nodes may be small SMPs (28 processors). Although the use of multiple
processors in a node together with a memory and a network interface is quite useful from
the cost-efficiency viewpoint.
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Suppose you want to achieve a speedup of 80 with 100 processors. What fraction
of the original computation can be sequential?
80 =
FractionParallel
+ (1 FractionParallel )
100
FractionParallel = 0.9975
shared memory: access shared address space implicitly via load and store
operations.
Message-Passing Multiprocessor
-
The address space can consist of multiple private address spaces that are
logically disjoint and cannot be addressed by a remote processor
Multicomputer (cluster):
-
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Developed by
Shared Data
Private Data
Cache Coherence
Unfortunately, caching shared data introduces a new problem because the view of
memory held by two different processors is through their individual caches, which,
without any additional precautions, could end up seeing two different values.
I.e, If two different processors have two different values for the same location, this
difficulty is generally referred to as cache coherence problem
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Informally:
Better:
If P writes x and then P1 reads it, Ps write will be seen by P1 if the read
and write are sufficiently far apart
Coherence
Consistency
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
The above three properties are sufficient to ensure coherence,When a written value will
be seen is also important. This issue is defined by memory consistency model. Coherence
and consistency are complementary.
migration: a data item can be moved to a local cache and used there in a
transparent fashion.
Directory based
Snooping based
Directory based
Sharing status of a block of physical memory is kept in one location called the
directory.
Snooping
Every cache that has a copy of data also has a copy of the sharing status of the
block.
Caches are also accessible via some broadcast medium (bus or switch)
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Snooping protocols are popular with multiprocessor and caches attached to single shared
memory as they can use the existing physical connection- bus to memory, to interrogate
the status of the caches. Snoop based cache coherence scheme is implemented on a
shared bus. Any communication medium that broadcasts cache misses to all the
processors.
Write strategies
Write to shared data: an invalidate is sent to all caches which snoop and
invalidate any copies
Read miss: further read will miss in the cache and fetch a new
copy of the data.
Write to shared data: broadcast on bus, processors snoop, and update any
copies
Write Invalidate
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Write Update
Example Protocol
Snooping coherence protocol is usually implemented by incorporating a finitestate controller in each node
That is, snooping operations or cache requests for different blocks can
proceed independently
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
that is, one operation may be initiated before another is completed, even
through only one cache access or one bus access is allowed at time
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Conclusion
Snooping cache over shared medium for smaller MP by invalidating other cached
copies on write
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Implementation Complications
Write Races:
Performance Measurement
Changing the processor count, cache size, and block size can affect these two
components of miss rate
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
The block is shared, but no word in the cache is actually shared, and
this miss would not occur if the block size were a single word
Assume that words x1 and x2 are in the same cache block, which is in the shared
state in the caches of P1 and P2. Assuming the following sequence of events,
identify each miss as a true sharing miss or a false sharing miss.
Time
P1
Write x1
2
3
Read x2
Write x1
4
5
P2
Write x2
Read x2
Example Result
x2 was invalidated by the write of P1, but that value of x1 is not used in
P2
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
The block containing x1 is marked shared due to the read in P2, but P2 did
not read x1. A write miss is required to obtain exclusive access to the
block
Coherence Problems
shared data are marked as uncacheable and only private data are kept in
caches
The directory per memory tracks state of every block in every cache
which caches have a copies of the memory block, dirty vs. clean, ...
Directory Protocols
Shared: 1 or more processors have the block cached, and the value in
memory is up-to-date (as well as in all the caches)
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Uncached: no processor has a copy of the cache block (not valid in any
cache)
Exclusive: Exactly one processor has a copy of the cache block, and it has
written the block, so the memory copy is out of date
In addition to tracking the state of each cache block, we must track the processors
that have copies of the block when it is shared (usually a bit vector for each
memory block: 1 if processor has copy)
Keep it simple(r):
home node: the node where the memory location and directory entry of an address
reside
remote node: the node that has a copy of a cache block (exclusive or shared)
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
identical states
write a shared cache block is treated as a write miss (without fetch the
block)
write miss: data fetch and selective invalidate operations sent by the directory
controller (broadcast in snooping protocols)
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Block is in Uncached state: the copy in memory is the current value; only possible
requests for that block are:
Read miss: requesting processor sent data from memory &requestor made
only sharing node; state of block made Shared.
Write miss: requesting processor is sent the value & becomes the Sharing
node. The block is made Exclusive to indicate that the only valid copy is
cached. Sharers indicates the identity of the owner.
Read miss: requesting processor is sent back the data from memory &
requesting processor is added to the sharing set.
Write miss: requesting processor is sent the value. All processors in the set
Sharers are sent invalidate messages, & Sharers is set to identity of
requesting processor. The state of the block is made Exclusive.
Block is Exclusive: current value of the block is held in the cache of the processor
identified by the set Sharers (the owner) => three possible directory requests:
Read miss: owner processor sent data fetch message, causing state of
block in owners cache to transition to Shared and causes owner to send
data to directory, where it is written to memory & sent back to requesting
processor.
Identity of requesting processor is added to set Sharers, which still
contains the identity of the processor that was the owner (since it still has a
readable copy). State is shared.
Data write-back: owner processor is replacing the block and hence must
write it back, making memory copy up-to-date
(the home directory essentially becomes the owner), the block is now
Uncached, and the Sharer set is empty.
Write miss: block has a new owner. A message is sent to old owner
causing the cache to send the value of the block to the directory from
which it is sent to the requesting processor, which becomes the new owner.
Sharers is set to identity of new owner, and state of block is made
Exclusive.
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Why Synchronize?
Need to know when it is safe for different processes to use shared data
Test-and-set: tests a value and sets it if the value passes the test
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
try:
mov
R3,R4
ll
R2,0(R1)
; load linked
sc
R3,0(R1)
; store conditional
beqz
R3,try
mov
R4,R2
ll
R2,0(R1)
; load linked
addi
R2,R2,#1
sc
R2,0(R1)
; store conditional
beqz
R2,try
Spin locks: processor continuously tries to acquire, spinning around a loop trying
to get the lock
lockit:
li
R2,#1
exch
R2,0(R1)
;atomic exchange
bnez
R2,lockit
;already locked?
Problem: exchange includes a write, which invalidates all other copies; this
generates considerable bus traffic
Solution: start by simply repeatedly reading the variable; when it changes, then
try exchange (test and test&set):
try:
li
R2,#1
lockit:
lw
R3,0(R1)
;load var
bnez
R3,lockit
exch
R2,0(R1)
;atomic exchange
bnez
R2,try
;already locked?
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
What is consistency? When must a processor see the new value? e.g., seems that
P1:
A = 0;
L1:
P2:
B = 0;
.....
.....
A = 1;
B = 1;
if (B == 0) ...
L2:
if (A == 0) ...
Several Relaxed Models for Memory Consistency since most programs are
synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW
to different addresses
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Key idea: allow reads and writes to complete out of order, but to use
synchronization operations to enforce ordering, so that a synchronized program
behaves as if the processor were sequentially consistent
Because retains ordering among writes, many programs that operate under
sequential consistency operate under this model, without additional
synchronization. Called processor consistency
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Chapter-7
Memory Hierarchy Design
Principle of locality
- most programs do not access all code or data uniformly.
Locality occurs
- Time (Temporal locality)
- Space (spatial locality)
Guidelines
Goal is provide a memory system with cost per byte than the next lower level
Each level maps addresses from a slower, larger memory to a smaller but faster
memory higher in the hierarchy.
Address mapping
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Address checking.
Memory Hierarchy
Upper
Capacity
Access Time
Faste
Cache
64 KB
1 ns
Cach
Block
Main Memory
512 MB
100ns
Capacit
Register
Spee
CPU Registers
500 bytes
0.25 ns
Memor
Page
Disk
100 GB
5 ms
I/O
File
Large
???
Lower
100,000
Performance
10,000
1,000
Processor
100
10
Memory
1
1980
1985
1990
1995
2000
2005
2010
Year
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Prototype
Fetched from memory and placed in cache with the address tag.
key design
Set associative
Find mapping
n-block in a set
Cache data
- Cache read.
- Cache write.
Write through: update cache and writes through to update memory.
Both strategies
- Use write buffer.
this allows the cache to proceed as soon as the data is placed in the
buffer rather than wait the full latency to write the data into memory.
Metric
used to measure the benefits is miss rate
No of access that miss
No of accesses
Write back: updates the copy in the cache.
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Capacity: cache cannot contain all that blocks that are needed for
the program.
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Instruction
Instruction count
Cache Optimizations
Six basic cache optimizations
- Increases larger hit time for larger cache memory and higher cost and power.
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Hit timeL1+ Miss rate L1 X ( Hit time L2 + Miss rate L2 X Miss penalty L2)
6. Avoiding address translation during indexing of the cache to reduce hit time:
- Caches must cope with the translation of a virtual address from the processor to a
physical address to access memory.
- common optimization is to use the page offset.
- part that is identical in both virtual and physical addresses- to index the cache.
Way prediction
Trace caches
Pipelined caches
Multibanked caches
Nonblocking caches
Compiler optimizations
Hardware prefetching
Compiler prefetching
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Small cache can help hit time since smaller memory takes less time to index
E.g., L1 caches same size for 3 generations of AMD microprocessors: K6,
Median ratios of access time relative to the direct-mapped caches are 1.32,
1.39, and 1.43 for 2-way, 4-way, and 8-way caches
2.50
1-way
2.00
2-way
4-way
8-way
1.50
1.00
0.50
16 KB
32 KB
64 KB
128 KB
256 KB
512 KB
1 MB
Cache size
How to combine fast hit time of Direct Mapped and have the lower conflict
misses of 2-way SA cache?
Way prediction: keep extra bits in cache to predict the way, or block within the
set, of next cache access.
Hit Time
Way-Miss Hit Time
Miss Penalty
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Miss 1st check other blocks for matches in next clock cycle
Accuracy 85%
1. better utilize long blocks (dont exit in middle of block, dont enter at label
in middle of block)
-
1. complicated address mapping since addresses no longer aligned to powerof-2 multiples of word size
4: Pentium 4
-
more clock cycles between the issue of the load and the use of the data
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
hit under miss reduces the effective miss penalty by working during miss vs.
ignoring CPU requests
hit under multiple miss or miss under miss may further lower the effective
miss penalty by overlapping multiple misses
0->1
1.2
1->2
1
2->64
0.8
Base
0.6
0.4
0.2
Integer
ora
spice2g6
nasa7
alvinn
hydro2d
mdljdp2
wave5
su2cor
doduc
swm256
tomcatv
fpppp
ear
mdljsp2
compress
xlisp
espresso
eqntott
Floating Point
FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
Rather than treat the cache as a single monolithic block, divide into independent
banks that can support simultaneous accesses
Banking works best when accesses naturally spread themselves across banks
mapping of addresses to banks affects behavior of memory system
E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0;
bank 1 has all blocks whose address modulo 4 is 1;
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Early restartAs soon as the requested word of the block arrives, send it to the
CPU and let the CPU continue execution
Spatial locality tend to want next sequential word, so not clear size of
benefit of just early restart
Critical Word FirstRequest the missed word first from memory and send it to
the CPU as soon as it arrives; let the CPU continue execution while filling the rest
of the words in the block
Long blocks more popular today Critical Word 1st Widely used
block
Eight optimization: Merging Write Buffer to Reduce Miss Penalty
If buffer contains modified blocks, the addresses can be checked to see if address
of new data matches the address of a valid write buffer entry
Increases block size of write for write-through cache of writes to sequential words,
bytes since multiword writes more efficient to memory
The Sun T1 (Niagara) processor, among many others, uses write merging
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4
byte blocks in software
Instructions
Data
Loop Fusion: Combine 2 independent loops that have same looping and
some variables overlap
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
0.1
0.05
50
100
150
Blocking Factor
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
Prefetching relies on having extra memory bandwidth that can be used without
penalty
Instruction Prefetching
Typically, CPU fetches 2 blocks on a miss: the requested block and the
next consecutive block.
Data Prefetching
2.20
2.00
1.80
1.60
1.40
1.20
1.00
1.97
gr
id
eq
ua
ke
1.49
1.40
1.32
ap
pl
u
1.29
sw
im
1.21
ga
lg
el
fa
ce
re
c
wu
pw
is
3d
fa
m
cf
m
SPECint2000
1.20
1.18
1.16
1.26
lu
ca
s
1.45
ga
p
Performance Improvement
SPECfp2000
Data Prefetch
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS
The techniques to improve hit time, bandwidth, miss penalty and miss rate generally
affect the other components of the average memory access equation as well as the
complexity of the memory hierarchy.
www.bookspar.com | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS