Ece4750 T01 Proc Scycle

ECE 4750 Computer Architecture, Fall 2014
T01 Single-Cycle Processors

School of Electrical and Computer Engineering
Cornell University
revision: 2014-09-03-17-21
1 Instruction Set Architecture 2
1.1. IBM 360 Instruction Set Architecture . . . . . . . . . . . . . 4
1.2. MIPS32 Instruction Set Architecture . . . . . . . . . . . . . 6
1.3. PARC Instruction Set Architecture . . . . . . . . . . . . . . 10
2 Single-Cycle Processors 14
2.1. Single-Cycle Processor Datapath . . . . . . . . . . . . . . . 15
2.2. Single-Cycle Processor Control Unit . . . . . . . . . . . . . 21
3 Analyzing Processor Performance 22
1
1 Instruction Set Architecture
1. Instruction Set Architecture
By early 1960s, IBM had several incompatible lines of computers!
Defense : 701
Scientic : 704, 709, 7090, 7094
Business : 702, 705, 7080
Mid-Sized Business : 1400
Decimal Architectures : 7070, 7072, 7074
Each system had its own:
Implementation and potentially even technology
Instruction set
I/O system and secondary storage (tapes, drums, disks)
Assemblers, compilers, libraries, etc
Application niche
IBM 360 was the rst line of
machines to separate ISA from
microarchitecture
Enabled same software to run on
different current and future
microarchitectures
Reduced impact of modifying the
microarchitecture enabling rapid
innovation in hardware
Register-Transfer Level
Circuits
Devices
Programming Language
Algorithm
Microarchitecture
Physics
Application
Operating System
Gate Level
Instruction Set Architecture
... the structure of a computer that a machine language programmer
must understand to write a correct (timing independent)
program for that machine.
Amdahl, Blaauw, Brooks, 1964
2
1 Instruction Set Architecture
ISA is the contract between software and hardware
1.
Representations for characters, integers, oating-point
Integer formats can be signed or unsigned
Floating-point formats can be single- or double-precision
Byte addresses can ordered within a word as either little- or big-endian
2.
Registers: general-purpose, oating-point, control
Memory: different addresses spaces for heap, stack, I/O
3.
Register: operand stored in registers
Immediate: operand is part of instruction
Direct: address of operand in memory is stored in instruction
Register Indirect: address of operand in memory is stored in register
Displacement: register indirect, addr is added to immediate
Autoincrement/decrement: register indirect, addr is automatically adj
PC-Relative: displacement is added to the program counter
4.
Integer and oating-point arithmetic instructions
Register and memory data movement instructions
Control transfer instructions
System control instructions
5.
Opcode, addresses of operands and destination, next instruction
Variable length vs. xed length
3
1 Instruction Set Architecture 1.1. IBM 360 Instruction Set Architecture
1.1. IBM 360 Instruction Set Architecture
How is data represented?
8-bit bytes, 16-bit half-words, 32-bit words, 64-bit double-words
IBM 360 is why bytes are 8-bits long today!
Where can data be stored?
2
24
32-bit memory locations
16 general-purpose 32-bit registers and 4 oating-point 64-bit registers
Condition codes, control ags, program counter
What operations can be done on data?
Large number of arithmetic, data movement, and control instructions
4
1 Instruction Set Architecture 1.1. IBM 360 Instruction Set Architecture
Model 30 Model 70
Storage 864 KB 256512 KB
Datapath 8-bit 64-bit
Circuit Delay 30 ns/level 5 ns/level
Local Store Main store Transistor registers
Control Store Read only 1s Conventional circuits
IBM 360 instruction set architecture completely hid
the underlying technological differences between various models
Signicant Milestone: The rst true ISA designed as a
portable hardware-software interface
IBM 360: 45 years later ...
The zSeries Z196 Microprocessor
5.2 GHz in IBM 45 nm SOI
1.4B transistors in 512 mm
2
Four cores per chip
Aggressive out-of-order execution
Four-level cache hierarchy with
embedded DRAM L3/L4
Can still run IBM 360 code!
Hot Chips 2010
5
1 Instruction Set Architecture 1.2. MIPS32 Instruction Set Architecture
1.2. MIPS32 Instruction Set Architecture
How is data represented?
8-bit bytes, 16-bit half-words, 32-bit words
32-bit single-precision, 64-bit double-precision oating point
Where can data be stored?
2
32
32-bit memory locations
32 general-purpose 32-bit registers, 32 SP (16 DP) oating-point registers
FP status register, Program counter
How can data be accessed?
Register, register indirect, displacement
What operations can be done on data?
Large number of arithmetic, data movement, and control instructions
How are instructions encoded?
Fixed-length 32-bit instructions
MIPS R2K: 1986, single-issue,
in-order, off-chip caches, 2 m,
815 MHz, 110K transistors, 80 mm
2
MIPS R10K: 1996, quad-issue,
out-of-order, on-chip caches, 0.35 m,
200 MHz, 6.8M transistors, 300 mm
2
6
Add Immediate Unsigned Word ADDIU
48 MIPS32 Architecture For Programmers Volume II: The MIPS32 Instruction Set, Revision 2.62
Copyright 2001-2003,2005,2008-2009 MIPS Technologies Inc. All rights reserved.
Format: ADDIU rt, rs, immediate MIPS32
Purpose: Add Immediate Unsigned Word
To add a constant to a 32-bit integer
Description: GPR[rt] GPR[rs] + immediate
The 16-bit signed immediate is added to the 32-bit value in GPR rs and the 32-bit arithmetic result is placed into
GPR rt.
No Integer Overow exception occurs under any circumstances.
Restrictions:
None
Operation:
temp GPR[rs] + sign_extend(immediate)
GPR[rt] temp
Exceptions:
None
Programming Notes:
The term unsigned in the instruction name is a misnomer; this operation is 32-bit modulo arithmetic that does not
trap on overow. This instruction is appropriate for unsigned arithmetic, such as address arithmetic, or integer arith-
metic environments that ignore overow, such as C language arithmetic.
31 26 25 21 20 16 15 0
ADDIU
001001
rs rt immediate
6 5 5 16
Load Word LW
Format: LW rt, offset(base) MIPS32
Purpose: Load Word
To load a word from memory as a signed value
Description: GPR[rt] memory[GPR[base] + offset]
The contents of the 32-bit word at the memory location specied by the aligned effective address are fetched, sign-
extended to the GPR register length if necessary, and placed in GPR rt. The 16-bit signed offset is added to the con-
tents of GPR base to form the effective address.
Restrictions:
The effective address must be naturally-aligned. If either of the 2 least-signicant bits of the address is non-zero, an
Address Error exception occurs.
Operation:
vAddr sign_extend(offset) + GPR[base]
if vAddr
1..0
0
2
then
SignalException(AddressError)
endif
(pAddr, CCA) AddressTranslation (vAddr, DATA, LOAD)
memword LoadMemory (CCA, WORD, pAddr, vAddr, DATA)
GPR[rt] memword
Exceptions:
TLB Rell, TLB Invalid, Bus Error, Address Error, Watch
31 26 25 21 20 16 15 0
LW
100011
base rt offset
6 5 5 16
Branch on Not Equal BNE
Format: BNE rs, rt, offset MIPS32
Purpose: Branch on Not Equal
To compare GPRs then do a PC-relative conditional branch
Description: if GPR[rs] GPR[rt] then branch
An 18-bit signed offset (the 16-bit offset eld shifted left 2 bits) is added to the address of the instruction following
the branch (not the branch itself), in the branch delay slot, to form a PC-relative effective target address.
If the contents of GPR rs and GPR rt are not equal, branch to the effective target address after the instruction in the
delay slot is executed.
Restrictions:
Processor operation is UNPREDICTABLE if a branch, jump, ERET, DERET, or WAIT instruction is placed in the
delay slot of a branch or jump.
Operation:
I: target_offset sign_extend(offset || 0
2
)
condition (GPR[rs] GPR[rt])
I+1: if condition then
PC PC + target_offset
endif
Exceptions:
None
Programming Notes:
With the 18-bit signed instruction offset, the conditional branch range is 128 KBytes. Use jump (J) or jump register
(JR) instructions to branch to addresses outside this range.
31 26 25 21 20 16 15 0
BNE
000101
rs rt offset
6 5 5 16
1 Instruction Set Architecture 1.3. PARC Instruction Set Architecture
1.3. PARC Instruction Set Architecture
Subset of MIPS32 with several important differences
Only little-endian, very simple address translation
No hi/lo registers, only 32 general purpose registers
Multiply and divide instructions target general purpose registers
Only a subset of all MIPS32 instructions
No branch delay slot
PARCv1: Very small subset suitable for examples
PARCv2: Subset suitable for executing simple C programs

without system calls (i.e., open, write, read)
PARCv3: Subset suitable for executing all integer

single-threaded C programs without system calls
PARCv4: Full PARC ISA
10
PARCv1 instruction assembly, semantics, and encoding
addu rd, rs, rt 6 5 5 5 5 6
R[rd] R[rs] + R[rt] 000000 rs rt rd 00000 100001
PC PC+4 31 26 25 21 20 16 15 11 10 6 5 0
addiu rt, rs, imm 6 5 5 16
R[rt] R[rs] + sext(imm) 001001 rs rt imm
PC PC+4 31 26 25 21 20 16 15 0
mul rd, rs, rt 6 5 5 5 5 6
R[rd] R[rs] * R[rt] 011100 rs rt rd 00000 000010
PC PC+4 31 26 25 21 20 16 15 11 10 6 5 0
lw rt, imm(rs) 6 5 5 16
R[rt] M[ R[rs] + sext(imm) ] 100011 rs rt imm
PC PC+4 31 26 25 21 20 16 15 0
sw rt, imm(rs) 6 5 5 16
M[ R[rs] + sext(imm) ] R[rt] 101011 rs rt imm
PC PC+4 31 26 25 21 20 16 15 0
j target 6 26
PC jtarg( PC, target ) 000010 target
31 26 25 0
jal target 6 26
R[31] PC + 4 000011 target
PC jtarg( PC, target ) 31 26 25 0
jr rs 6 5 5 5 5 6
PC R[rs] 000000 rs 00000 00000 00000 001000
31 26 25 21 20 16 15 11 10 6 5 0
bne rs, rt, imm 6 5 5 16
if ( R[rs] != R[rt] ) 000101 rs 0 imm
PC PC+4 + imm*4 31 26 25 21 20 16 15 0
11
PARCv1 vector-vector add assembly and C program
C code for doing element-wise vector addition.
Equivalent PARCv1 assembly code. Recall that arguments are passed in
registers r4r7, return value is stored to r2, and the return address is
stored in r31.
12
PARCv1 mystery assembly and C program
What is the C code corresponding to the PARCv1 assembly shown
below? Assume assembly implements a function.
addiu r12, r0, 0
loop:
lw r13, 0(r4)
bne r13, r6, foo
addiu r2, r12, 0
jr r31
foo:
addiu r4, r4, 4
addiu r12, r12, 1
bne r12, r5, loop
addiu r2, r0, -1
jr r31
13
2 Single-Cycle Processors
2. Single-Cycle Processors
Control Signals Status Signals
Control Unit
Datapath
Inst
Val
Inst
Req
Inst
Resp
Data
Val
Data
Req
Data
Resp
<1 cycle
combinatiional
Memory
14
2 Single-Cycle Processors 2.1. Single-Cycle Processor Datapath
2.1. Single-Cycle Processor Datapath
Implementing ADDU Instruction
pc
regfile
(read) regfile
(write)
regfile
_wen
regfile
_waddr
Implementing ADDIU Instruction
pc
regfile
(read) regfile
(write)
regfile
_wen
regfile
_waddr
15
Implementing ADDU and ADDIU Instruction
pc
regfile
(read) regfile
(write)
regfile
_wen
regfile
_waddr
alu
ir[20:16]
ir[25:21]
+4
Instruction Mem
Interface
To control unit
pc_plus4
op1
_sel
sext
ir[15:0]
Adding the MUL Instruction
pc
regfile
(read) regfile
(write)
regfile
_wen
regfile
_waddr
alu
ir[20:16]
ir[25:21]
+4
Instruction Mem
Interface
To control unit
pc_plus4
op1
_sel
sext
ir[15:0]
mul
wb_sel
16
Adding the LW Instruction
pc
regfile
(read) regfile
(write)
regfile
_wen
regfile
_waddr
alu
ir[20:16]
ir[25:21]
+4
Instruction Mem
Interface
To control unit
pc_plus4
op1
_sel
sext
ir[15:0]
mul
wb_sel
Data Mem
Interface
Adding the SW Instruction
pc
regfile
(read) regfile
(write)
regfile
_wen
regfile
_waddr
alu
ir[20:16]
ir[25:21]
+4
Instruction Mem
Interface
To control unit
pc_plus4
op1
_sel
sext
ir[15:0]
mul
wb_sel
Data Mem
Interface
17
Adding the J Instruction
pc
regfile
(read) regfile
(write)
regfile
_wen
regfile
_waddr
alu
ir[20:16]
ir[25:21]
+4
Instruction Mem
Interface
To control unit
pc_plus4
op1
_sel
sext
ir[15:0]
mul
wb_sel
Data Mem
Interface
ir[25:0]
j_tgen
j_targ
pc_sel
Adding the JAL Instruction
pc
regfile
(read) regfile
(write)
regfile
_wen
regfile
_waddr
alu
ir[20:16]
ir[25:21]
+4
Instruction Mem
Interface
To control unit
pc_plus4
op1
_sel
sext
ir[15:0]
mul
wb_sel
Data Mem
Interface
ir[25:0]
j_tgen
j_targ
pc_sel
alu_func
18
Adding the JR Instruction
pc
regfile
(read) regfile
(write)
regfile
_wen
regfile
_waddr
alu
ir[20:16]
ir[25:21]
+4
Instruction Mem
Interface
To control unit
pc_plus4
op1
_sel
sext
ir[15:0]
mul
wb_sel
Data Mem
Interface
ir[25:0]
j_tgen
j_targ
pc_sel
alu_func
jr
Adding the BNE Instruction
pc
regfile
(read) regfile
(write)
regfile
_wen
regfile
_waddr
alu
ir[20:16]
ir[25:21]
+4
Instruction Mem
Interface
To control unit
pc_plus4
op1
_sel
sext
ir[15:0]
mul
wb_sel
Data Mem
Interface
ir[25:0]
j_tgen
j_targ
pc_sel
alu_func
jr
ir[15:0]
br_tgen
eq
br_targ
19
Adding a New Auto-Incrementing Load Instruction
Draw on the datapath diagram what paths we need to use as well as
any new paths we will need to add in order to implement the following
auto-incrementing load instruction.
lw.ai rt, imm(rs)
R[rt] M[ R[rs] + sext(imm) ]; R[rs] R[rs] + 4
pc
regfile
(read) regfile
(write)
regfile
_wen
regfile
_waddr
alu
ir[20:16]
ir[25:21]
+4
Instruction Mem
Interface
To control unit
pc_plus4
op1
_sel
sext
ir[15:0]
mul
wb_sel
Data Mem
Interface
ir[25:0]
j_tgen
j_targ
pc_sel
alu_func
jr
ir[15:0]
br_tgen
eq
br_targ
20
2 Single-Cycle Processors 2.2. Single-Cycle Processor Control Unit
2.2. Single-Cycle Processor Control Unit
inst pc_sel op1_sel alu_func wb_sel waddr wen
addu
addiu
mul
lw
sw
j
jal
jr
bne
lw.ai
21
3 Analyzing Processor Performance
3. Analyzing Processor Performance
Time
Program
=
Instructions
Program

Cycles
Instruction

Time
Cycles
Instructions / program depends on source code, compiler, ISA
Cycles / instruction (CPI) depends on ISA, microarchitecture
Time / cycle depends upon microarchitecture and implementation
Microarchitecture CPI Cycle Time
this topic Single-Cycle Processor 1 long
Topic 02 FSM Processor >1 short
Topic 03 Pipelined Processor 1 short
Students often confuse Cycle Time with the execution time
of a sequence of transactions measured in cycles.
Cycle Time is the clock period or the inverse of the clock frequency.
22
Estimating cycle time
There are many paths through the design that start at a state element
and end at a state element. The critical path is the longest path across
all of these paths. We can usually use a simple rst-order static timing
estimate to estimate the cycle time (i.e., the clock period and thus also
the clock frequency).
pc
regfile
(read) regfile
(write)
regfile
_wen
regfile
_waddr
alu
ir[20:16]
ir[25:21]
+4
Instruction Mem
Interface
To control unit
pc_plus4
op1
_sel
sext
ir[15:0]
mul
wb_sel
Data Mem
Interface
ir[25:0]
j_tgen
j_targ
pc_sel
alu_func
jr
ir[15:0]
br_tgen
eq
br_targ
23
Estimating execution time
Using our rst-order equation for processor performance, how long in
nanoseconds will it take to execute the vector-vector add example
assuming n is 64?
loop:
lw r12, 0(r4)
lw r13, 0(r5)
addu r14, r12, r13
sw r14, 0(r6)
addiu r4, r4, 4
addiu r5, r5, 4
addiu r6, r6, 4
addiu r7, r7, -1
bne r7, r0, loop
jr r31
Using our rst-order equation for processor performance, how long in
nanoseconds will it take to execute the mystery program assuming n is
64 and that we nd a match on the 10th element.
addiu r12, r0, 0
loop:
lw r13, 0(r4)
bne r13, r6, foo
addiu r2, r12, 0
jr r31
foo:
addiu r4, r4, 4
addiu r12, r12, 1
bne r12, r5, loop
addiu r2, r0, -1
jr r31
24

Ece4750 T01 Proc Scycle

Uploaded by

Copyright:

Available Formats

Ece4750 T01 Proc Scycle

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ece4750 T01 Proc Scycle

Uploaded by

Copyright:

Available Formats

ECE 4750 Computer Architecture, Fall 2014

T01 Single-Cycle Processors

PARCv2: Subset suitable for executing simple C programs

PARCv3: Subset suitable for executing all integer

PARCv4: Full PARC ISA

You might also like