RISC Processors - All Syllabus5

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 31

An Introduction to RISC Processors

We are going to describe how microprocessor manufacturers took a new look at processor architectures in the 1980s and started designing simpler but faster processors. We begin by explaining why chip designers turned their backs on the conventional complex instruction set computer (CISC) such at the 68K and the Intel 86X families and started producingreduced instruction set computers (RISCs) such as MIPS and the PowerPC. RISC processors have simpler instruction sets than CISC processors (although this is a rather crude distinction between these families, as we shall soon see). By the mid 90s many of these so-called RISC processors were considerably more complex than some of the CISCs they replaced. That isn't a paradox. The RISC processor isn't really a cut-down computer architectureit represents a new approach to architecture design. In fact, the distinction between CISC and RISC is now so blurred that virtually all processors now have both RISC and CISC features.

The RISC Revolution


Before we look at the ARM, we describe the history and characteristics of RISC architecture. From the introduction of the microprocessor in the 1970s to the mid 1980's there seems to have been an almost unbroken trend towards more and more complex (you might even say Baroque) architectures. Some of these architectures developed rather like a snowball rolling downhill. Each advance in chip fabrication technology allowed designers to add more and more layers to the microprocessor's central core. Intel's 8086 family illustrates this trend particularly well, because Intel took their original 16-bit processor and added more features in each successive generation. This approach to chip design leads to cumbersome architectures and inefficient instruction sets, but it has the tremendous commercial advantage that end users don't have to pay for new software when they buy the latest reincarnation of a microprocessor. A reaction against the trend toward greater architectural complexity began at IBM with their 801 architecture and continued at Berkeley where Patterson and Ditzel coined the term RISC to describe a new class of architectures that reversed earlier trends in microcomputer design. According to popular wisdom RISC architectures are streamlined versions of traditional complex instruction set computers. This notion is

both misleading and dangerous, because it implies that RISC processors are in some way cruder versions of existing architectures. In brief, RISC architectures re-deploy to better effect some of the silicon real estate used to implement complex instructions and elaborate addressing modes in conventional microprocessors of the 68000 and 8086 generation. The mnemonic "RISC" should really stand for regular instruction set computer. Two factors influencing the architecture of first- and second-generation microprocessors were microprogramming and the desire to help compiler writers by providing ever more complex instruction sets. The latter is called closing the semantic gap (i.e., reducing the difference between high-level and low-level languages). By complex instructions we mean instruction like MOVE 12(A3,D0),D2 and ADD (A6),D3 that carry out multi-step operations in a single machine-level instruction. The MOVE 12(A3,D0),D2 generates an effective address by adding the contents of A3 to the contents of D0 plus the literal 12. The resulting address is used to access the source operand that is loaded into register D2. Microprogramming achieved its highpoint in the 1970s when ferrite core memory had a long access time of 1 ms or more and semiconductor high-speed random access memory was very expensive. Quite naturally, computer designers used the slow main store to hold the complex instructions that made up the machine-level program. These machine-level instructions are interpreted by microcode in the much faster microprogram control store within the CPU. Today, main stores use semiconductor memory with an access time of 50 ns or less, and most of the advantages of microprogramming have evaporated. Indeed, the goal of a RISC architecture is to execute an instruction in a single machine cycle. A corollary of this statement is that complex instructions can't be executed by RISC architectures. Before we look at RISC architectures, we have to describe some of the research that led to the search for better architectures.

Instruction Usage
Computer scientists carried out extensive research over a decade or more in the late 1970s into the way in which computers execute programs. Their studies demonstrated that the relative frequency with which different classes of instructions are executed is not uniform and that some types of instruction are executed far more frequently than other types. Fairclough divided machine-level instructions into eight groups according to type and compiled the statistics shown in Table 1. The "mean value of instruction use" gives the percentage of times that instructions in that group are executed averaged over both program types and computer architecture. These figures relate to early 8-bit processors.

Table 1 Instruction usage as a function of instruction type Instruction Group 1 Mean value of instruction use 45.28 2 28.73 3 10.75 4 5.92 5 3.91 6 2.93 7 2.05

8 0.4

These eight instruction groups in table 1 are:


Data movement Program flow control (i.e., branch, call, return) Arithmetic Compare Logical Shift Bit manipulation Input/output and miscellaneous

Table 1 convincingly demonstrates that the most common instruction type is the data movement primitive of the form P: = Q in a high-level language or MOVE P,Q in a low-level language. Similarly, the program flow control group that includes both conditional and unconditional branches (together with subroutine calls and returns) forms the second most common group of instructions. Taken together, the data movement and program flow control groups account for 74% of all instructions. A corollary of this statement is that we can expect a large program to contain only 26% of instructions that are not data movement or program flow control primitives. An inescapable inference from such results is that processor designers might be better employed devoting their time to optimizing the way in which machines handle instructions in groups one and two, than in seeking new powerful instructions that are seldom used. In the early days of the microprocessor, chip manufacturers went out of their way to provide special instructions that were unique to their products. These instructions were then heavily promoted by the company's sales force. Today, we can see that their efforts should have been directed towards the goal of optimizing the most frequently used instructions. RISC architectures have been designed to exploit the programming environment in which most instructions are data movement or program control instructions. Another aspect of computer architecture that was investigated was the optimum size of literal operands (i.e., constants). Tanenbaum reported the remarkable result that 56% of all constant values lie in the range -15 to +15 and that 98% of all constants lie in the range -511 to +511. Consequently, the inclusion of a 5-bit constant field in an instruction would cover over half the occurrences of a literal. RISC architectures have

sufficiently long instruction lengths to include a literal field as part of the instruction that caters for the majority of literals. Programs use subroutines heavily, and an effective architecture should optimize the way in which subroutines are called, parameters passed to and from subroutines, and workspace allocated to local variables created by subroutines. Research showed that in 95% of cases twelve words of storage are sufficient for parameter passing and local storage. A computer with twelve registers should be able to handle all the operands required by most subroutines without accessing main store. Such an arrangement would reduces the processor-memory bus traffic associated with subroutine calls.

Characteristics of RISC Architectures


Having described the ingredients that go into an efficient architecture, we now look at the attributes of first generation RISCs before covering RISC architectures in more detail. The characteristics of an efficient RISC architecture are: RISC processors have sufficient on-chip registers to overcome the worst effects of the processor-memory bottleneck. Registers can be accessed more rapidly than off-chip main store. Although today's processors rely heavily on fast on-chip cache memory to increase throughput, registers still offer the highest performance. RISC processors have three-address, register-to-register architectures with instructions in the form OPERATION Ra,Rb,Rc, where Ra, Rb, and Rc are general-purpose registers. Because subroutines calls are so frequently executed, (some) RISC architectures make provision for the efficient passing of parameters between subroutines. Instructions that modify the flow of control (e.g., branch instructions) are implemented efficiently because they comprise about 20 to 30% of a typical program. RISC processors aim to execute one instruction per clock cycle. This goal imposes a limit on the maximum complexity of instructions. RISC processors don't attempt to implement infrequently used instructions. Complex instructions waste silicon real-estate and conflict with the requirements of point 8. Moreover, the inclusion of complex instructions increases the time taken to design, fabricate and test a processor.

A corollary of point 5 is that an efficient architecture should not be microprogrammed, because microprogramming interprets a machine-level instruction by executing microinstructions. In the limit, a RISC processor is close to a microprogrammed architecture in which the distinction between machine cycle and microcode has vanished. An efficient processor should have a single instruction format (or at least very few formats). A typical CISC processor such as the 68000 has variable-length instructions (e.g., from 2 to 10 bytes). By providing a single instruction format, the decoding of a RISC instruction into its component fields can be performed by a minimum level of decoding logic. It follows that a RISC's instruction length should be sufficient to accommodate the operation code field and one or more operand fields. Consequently, a RISC processor may not utilize memory space as efficiently as does a conventional CISC microprocessor. Two fundamental aspects of the RISC architecture that we cover later are its register set and the use of pipelining. Multiple overlapping register windows were implemented by the Berkeley RISC to reduce the overhead incurred by transferring parameters between subroutines. Pipelining is a mechanism that permits the overlapping of instruction execution (i.e., internal operations are carried out in parallel). Many of the features of RISC processors are not new, and have been employed long before the advent of the microprocessor. The RISC revolution happened when all these performance-enhancing techniques were brought together and applied to microprocessor design.

The Berkeley RISC


Although many CISC processors were designed by semiconductor manufacturers, one of the first RISC processors came from the University of California at Berkeley. The Berkeley RISC wasn't a commercial machine, although it had a tremendous impact on the development of later RISC architectures. Figure 1 describes the format of a Berkeley RISC instruction. Each of the 5-bit operand fields (Destination, Source 1, Source 2) permits one of 32 internal registers to be accessed. Figure 1 Format of the Berkeley RISC instruction

The single-bit set condition code field, Scc, determines whether the condition code bits are updated after the execution of an instruction. The 14-bit Source 2 field has two functions. If the IM bit (immediate) is 0, the Source 2 field specifies one of 32 registers. If the IM bit is 1, the Source 2 field provide a 13-bit literal operand. Since five bits are allocated to each operand field, it follows that this RISC has 25 = 32 internal registers. This last statement is emphatically not true, since the Berkeley RISC has 138 user-accessible general-purpose internal registers. The reason for the discrepancy between the number of registers directly addressable and the actual number of registers is due to a mechanism called windowing that gives the programmer a view of only a subset of all registers at any instant. Register R0 is hardwired to contain the constant zero. Specifying R0 as an operand is the same as specifying the constant 0.

Register Windows
An important feature of the Berkeley RISC architecture is the way in which it allocates new registers to subroutines; that is, when you call a subroutine, you get some new registers. If you can create 12 registers out of thin air when you call a subroutine, each subroutine will have its own workspace for temporary variables, thereby avoiding relatively slow accesses to main store.

Although only 12 or so registers are required by each invocation of a subroutine, the successive nesting of subroutines rapidly increases the total number of on-chip registers assigned to subroutines. You might think that any attempt to dedicate a set of registers to each new procedure is impractical, because the repeated calling of nested subroutines will require an unlimited amount of storage. Subroutines can indeed be nested to any depth, but research has demonstrated that on average subroutines are not nested to any great depth over short periods. Consequently, it is feasible to adopt a modest number of local register sets for a sequence of nested subroutines. Figure 2 provides a graphical representation of the execution of a typical program in terms of the depth of nesting of subroutines as a function of time. The trace goes up each time a subroutine is called and down each time a return is made. If subroutines were never called, the trace would be a horizontal line. This figure demonstrates is that even though subroutines may be nested to considerable depths, there are periods or runs of subroutine calls and returns that do not require a nesting level of greater than about five. Figure 2 Depth of subroutine nesting as a function of time

A mechanism for implementing local variable work space for subroutines adopted by the designers of the Berkeley RISC is to support up to eight nested subroutines by providing on-chip work space for each subroutine. Any further nesting forces the CPU to dump registers to main memory, as we shall soon see. Memory space used by subroutines can be divided into four types: Global space Global space is directly accessible by all subroutines and holds constants and data that may be required from any point within the program. Most conventional microprocessors have only global registers.

Local space Local space is private to the subroutine. That is, no other subroutine can access the current subroutine's local address space from outside the subroutine. Local space is employed as working space by the current subroutine. Imported parameter space Imported parameter space holds the parameters imported by the current subroutine from its parent that called it. In Berkeley RISC terminology these are called the high registers. Exported parameter space Exported parameter space holds the parameters exported by the current subroutine to its child. In RISC terminology these are called the low registers.

Windows and Parameter Passing


One of the reasons for the high frequency of data movement operations is the need to pass parameters to subroutines and to receive them from subroutines. The Berkeley RISC architecture deals with parameter passing by means of multiple overlapped windows. A window is the set of registers visible to the current subroutine. Figure 3 illustrates the structure of the Berkeley RISC's overlapping windows. Only three consecutive windows (i-1, i, i+1) of the 8 windows are shown in Figure 3. The vertical columns represent the registers seen by the corresponding window. Each window sees 32 registers, but they aren't all the same 32 registers. The Berkeley RISC has a special-purpose register called the window pointer, WP, that indicates the current active window. Suppose that the processor is currently using the ith window set. In this case the WP contains the value i. The registers in each of the 8 windows are divided into four groups shown in Table 2. Table 2 Berkeley RISC register types Register name R0 to R9 R10 to R15 R16 to R25 R26 to R31

Register type The global register set is always accessible. Six registers used by the subroutine to receive parameters from its parent an parent. Ten local registers accessed only by the current subroutine that cannot be ac subroutine. Six registers used by the subroutine to pass parameters to and from its own called by itself).

All windows consist of 32 addressable registers, R0 to R31. A Berkeley RISC instruction of the form ADD R3,R12,R25 implements [R25] [R3] + [R12], where R3 lies within the window's global address space, R12 lies within its import from (or export to) parent subroutine space, and R25 lies within its local address space. RISC arithmetic and logical instructions always involve 32-bit values (there are no 8-bit or 16-bit operations). The Berkeley RISC's subroutine call is CALL Rd,<address> and is similar to a typical CISC instruction BSR <address>. Whenever a subroutine is invoked by CALLR Rd,<address>, the contents of the window pointer are incremented by 1 and the current value of the program counter saved in register Rd of the new window. The Berkeley RISC doesn't employ a conventional stack in external main memory to save subroutine return addresses. Figure 3 Berkeley windowed register sets

Once a new window has been invoked (in Figure 3 this is window i), the new subroutine sees a different set of registers to the previous window. Global registers R0 to R9 are an exception because they are common to all windows. Window R10 of the child (i.e., called) subroutine corresponds to (i.e., is the same as) window R26 of the calling (i.e., parent) subroutine. Suppose you wish to send a parameter to a subroutine. If the parameter is in R10 and you call a subroutine, register R26 in this subroutine will contain the parameter. There hasn't been a physical transfer of data because register R26 in the current window is simply register R10 in the previous window. Figure 4 Relationship between register number, window number, and register address

The physical arrangement of the Berkeley RISC's window system is given in Figure 4. On the left hand side of the diagram is the actual register array that holds all the onchip general-purpose registers. The eight columns associated with windows 0 to 7 demonstrate how each window is mapped onto the physical memory array on the chip and how the overlapping regions are organized. The windows are logically arranged in a circular fashion so that window 0 follows window 7 and window 7 precedes window 0. For example, if the current window pointer is 3 and you access register R25, location 74 is accessed in the register file. However, if you access register R25 when the window pointer is 7, you access location 137. The total number of physical registers required to implement the Berkeley windowed register set is: 10 global + 8 x 10 local + 8 x 6 parameter transfer registers = 138 registers. Window Overflow Unfortunately, the total quantity of on-chip resources of any processor is finite and, in the case of the Berkeley RISC, the registers are limited to 8 windows. If subroutines are nested to a depth greater than or equal to 7, window overflow is said to occur, as there is no longer a new window for the next subroutine invocation. When an overflow takes place, the only thing left to do is to employ external memory to hold the overflow data. In practice the oldest window is saved rather than the new window created by the subroutine just called. If the number of subroutine returns minus the number of subroutine calls exceeds 8, window underflow takes place. Window underflow is the converse of window overflow and the youngest window saved in main store must be returned to a window. A considerable amount of research was carried out into dealing with window overflow efficiently. However, the imaginative use of windowed register sets in the Berkeley RISC was not adopted by many of the later RISC architectures. Modern RISC generally have a single set of 32 general-purpose registers.

RISC Architecture and Pipelining


We now describe pipelining, one of the most important techniques for increasing the throughput of a digital system that uses the regular structure of a RISC to carry out internal operations in parallel.

Figure 5 illustrates the machine cycle of a hypothetical microprocessor executing an ADD P instruction (i.e., [A] [R] + [M(P)], where A is an on-chip general purpose register and P is a memory location. The instruction is executed in five phases: Instruction fetch Read the instruction from the system memory and increment the program counter. Instruction decode Decode the instruction read from memory during the previous phase. The nature of the instruction decode phase is dependent on the complexity of the instruction encoding. A regularly encoded instruction might be decoded in a few nanoseconds with two levels of gating whereas a complex instruction format might require ROM-based look-up tables to implement the decoding. Operand fetch The operand specified by the instruction is read from the system memory or an on-chip register and loaded into the CPU. Execute The operation specified by the instruction is carried out. Operand store The result obtained during the execution phase is written into the operand destination. This may be an on-chip register or a location in external memory. Figure 5 Instruction Execution

Each of these five phases may take a specific time (although the time taken would normally be an integer multiple of the system's master clock period). Some instructions require less than five phases; for example, CMP R1,R2 compares R1 and R2 by subtracting R1 from R2 to set the condition codes and does not need an operand store phase. The inefficiency in the arrangement of Figure 5 is immediately apparent. Consider the execution phase of instruction interpretation. This phase might take one fifth of an instruction cycle leaving the instruction execution unit idle for the remaining 80% of the time. The same rule applies to the other functional units of the processor, which also lie idle for 80% of the time. A technique called instruction pipelining can be employed to increase the effective speed of the processor by overlapping in time the

various stages in the execution of an instruction. In the simplest of terms, a pipelined processor executes instruction i while fetching instruction i + 1 at the same time. The way in which a RISC processor implements pipelining is described in Figure 6. The RISC processor executes the instruction in four steps or phases: instruction fetch from external memory, operand fetch, execute, and operand store (we're using a 4stage system because a separate "instruction decode" phase isn't normally necessary). The internal phases take approximately the same time as the instruction fetch, because these operations take place within the CPU itself and operands are fetched from and stored in the CPU's own register file. Instruction 1 in Figure 6 begins in time slot 1 and is completed at the end of time slot 4. Figure 6 Pipelining and instruction overlap

In a non-pipelined processor, the next instruction doesn't begin until the current instruction has been completed. In the pipelined system of Figure 6, the instruction fetch phase of instruction 2 begins in time slot 2, at the same time that the operand is being fetched for instruction 1. In time slot 3, different phases of instructions 1, 2, and 3 are being executed simultaneously. In time slot 4, all functional units of the system are operating in parallel and an instruction is completed in every time slot thereafter. An n-stage pipeline can increase throughput by up to a factor of n.

Pipeline Bubbles
A pipeline is an ordered structure that thrives on regularity. At any stage in the execution of a program, a pipeline contains components of two or more instructions at varying stages in their execution. Consider Figure 7 in which a sequence of instructions is being executed in a 4-stage pipelined processor. When the processor encounters a branchinstruction, the following instruction is no longer found at the next sequential address but at the target address in the branch instruction. The processor is forced to reload its program counter with the value provided by the branch instruction. This means that all the useful work performed by the pipeline must now be thrown away, since the instructions immediately following the branch are not going to be executed. When information in a pipeline is rejected or the pipeline is held up by the introduction of idle states, we say that a bubble has been introduced. Figure 7 The pipeline bubble caused by a branch

As we have already stated, program control instructions are very frequent. Consequently, any realistic processor using pipelining must do something to overcome the problem of bubbles caused by instructions that modify the flow of control (branch, subroutine call and return). The Berkeley RISC reduces the effect of bubbles by refusing to throw away the instruction following a branch. This mechanism is called a delayed jump or a branch-and-execute technique because the instruction immediately after a branch is always executed. Consider the effect of the following sequence of instructions:
ADD ADD ADD R1,R2,R3 R2,R4,R5 R7,R8,R9 [R3] [R1] + [R2] [PC] [R5] [R2] + [R4] This is executed Not executed because the branch is taken JMPX N

[N] Goto address N

The processor calculates R5 := R2 + R4 before executing the branch. This sequence of instructions is most strange to the eyes of a conventional assembly language programmer, who is not accustomed to seeing an instruction executed after a branch has been taken.

Unfortunately, it's not always possible to arrange a program in such a way as to include a useful instruction immediately after a branch. Whenever this happens, the compiler must introduce a no operation instruction, NOP, after the branch and accept the inevitability of a bubble. Figure 8 demonstrates how a RISC processor implements a delayed jump. The branch described in Figure 8 is a computed branch whose target address is calculated during the execute phase of the instruction cycle. Figure 8 Delayed branch

Another problem caused by pipelining is data dependency in which certain sequences of instructions run into trouble because the current operation requires a result from the previous operation and the previous operation has not yet left the pipeline. Figure 9 demonstrates how data dependency occurs. Figure 9 Data dependency

Suppose a programmer wishes to carry out the apparently harmless calculation X := (A + B)AND(A + B - C). Assuming that A, B, C, X, and two temporary values, T1 and T2, are in registers in the current window, we can write: ADD A,B,T1 SUB T1,C,T2 AND T1,T2,X
[T1] [A] + [B] [T2] [T1] - [C] [X] [T1] [T2]

Instruction i + 1 in Figure 9 begins execution during the operand fetch phase of the previous instruction. However, instruction i + 1 cannot continue on to its operand fetch phase, because the very operand it requires does not get written back to the register file for another two clock cycles. Consequently a bubble must be introduced in the pipeline while instruction i + 1 waits for its data. In a similar fashion, the logical AND operation also introduces a bubble as it too requires the result of a previous operation which is in the pipeline. Figure 10 demonstrates a technique called internal forwarding designed to overcome the effects of data dependency. The following sequence of operations is to be executed. ADD R1,R2,R3 ADD 2. R4,R5,R6 ADD 3. R3,R4,R7 ADD 4. R7,R1,R8 1. [R3] [R1] + [R2]
[R6] [R4] + [R5] [R7] [R3] + [R4] [R8] [R7] + [R1]

Figure 10 Internal forwarding

In this example, instruction 3 (i.e., ADD R3,R4,R7) uses an operand generated by instruction 1 (i.e., the contents of register R3). Because of the intervening instruction 2, the destination operand generated by instruction 1 has time to be written into the register file before it is read as a source operand by instruction 3. Instruction 3 generates a destination operand R7 that is required as a source operand by the next instruction. If the processor were to read the source operand requested by instruction 4 from the register file, it would see the old value of R7. By means of internal forwarding the processor transfers R7 from instruction 3's execution unit directly to the execution unit of instruction 4 (see Figure 10).

Accessing External Memory in RISC Systems


Conventional CISC processors have a wealth of addressing modes that are used in conjunction with memory reference instructions. For example, the 68020 implements ADD D0,-(A5) which adds the contents of D0 to the top of the stack pointed at by A5 and then pushes the result on to this stack. In their ruthless pursuit of efficiency, the designers of the Berkeley RISC severely restricted the way in which it accesses external memory. The Berkeley RISC permits only two types of reference to external memory: a load and a store. All arithmetic and

logical operations carried out by the RISC apply only to source and destination operands in registers. Similarly, the Berkeley RISC provides a limited number of addressing modes with which to access an operand in the main store. It's not hard to find the reason for these restrictions on external memory accessesan external memory reference takes longer than an internal operation. We now discuss some of the general principles of Berkeley RISC load and store instructions. Consider the load register operation of the form LDXW (Rx)S2,Rd that has the effect [Rd] [M([Rx] + S2)]. The operand address is the contents of the memory location pointed at by register Rx plus offset S2. Figure 11 demonstrates the sequence of actions performed during the execution of this instruction. During the source fetch phase, register Rx is read from the register file and used to calculate the effective address of the operand in the execute phase. However, the processor can't progress beyond the execute phase to the store operand phase, because the operand hasn't been read from the main store. Therefore the main store must be accessed to read the operand and a store operand phase executed to load the operand into destination register Rd. Because memory accesses introduce bubbles into the pipeline, they are avoided wherever possible. Figure 11 The load operation

The Berkeley RISC implements two basic addressing modes: indexed and program counter relative. All other addressing modes can (and must) be synthesized from these two primitives. The effective address in the indexed mode is given by: EA = [Rx] + S2 where Rx is the index register (one of the 32 general purpose registers accessible by the current subroutine) and S2 is an offset. The offset can be either a general-purpose register or a 13-bit constant. The effective address in the program counter relative mode is given by: EA = [PC] + S2 where PC represents the contents of the program counter and S2 is an offset as above. These addressing modes include quite a powerful toolbox: zero, one or two pointers and a constant offset. If you wonder how we can use an addressing mode without an index (i.e., pointer) register, remember that R0 in the global register set permanently contains the constant 0. For example, LDXW (R12)R0,R3 uses simple address register indirect addressing, whereas LDXW (R0)123,R3 uses absolute addressing (i.e., memory location 123). There's a difference between addressing modes permitted by load and store operations. A load instruction permits the second source, S2, to be either an immediate value or a second register, whereas a store instruction permits S2 to be a 13-bit immediate value only. This lack of symmetry between the load and store addressing modes is because a "load base+index" instruction requires a register file with two ports, whereas a "store base+index" instruction requires a register file with three ports. Two-ported memory allows two simultaneous accesses. Three-ported memory allows three simultaneous accesses and is harder to design. Figure 1 defines just two basic Berkeley RISC instruction formats. The short immediate format provides a 5-bit destination, a 5-bit source 1 operand and a 14-bit short source 2 operand. The short immediate format has two variations: one that specifies a 13-bit literal for source 2 and one that specifies a 5-bit source 2 register address. Bit 13 specifies whether the source 2 operand is a 13-bit literal or a 5 bit register pointer. The long immediate format provides a 19-bit source operand by concatenating the two source operand fields. Thirteen-bit and 19-bit immediate fields may sound a little strange at first sight. However, since 13 + 19 = 32, the Berkeley RISC permits a full

32-bit value to be loaded into a window register in two operations. In the next section we will discover that the ARM processor deals with literals in a different way. A typical CISC microprocessor might take the same number of instruction bits to perform the same action (i.e., a 32-bit operation code field followed by a 32-bit literal). The following describes some of the addressing modes that can be synthesized from the RISC's basic addressing modes. 1. Absolute addressing EA = 13-bit offset Implemented by setting Rx = R0 = 0, S2 = 13-bit constant. 2. Register indirect EA = [Rx] Implemented by setting S2 = R0 = 0. 3. Indexed addressing EA = [Rx] + Offset Implemented by setting S2 = 13-bit constant. 4. Two-dimensional byte addressing (i.e., byte array access) EA = [Rx] + [Ry] Implemented by setting S2 = [Ry]. This mode is available only for load instructions. Conditional instructions (i.e., branch operations) do not require a destination address and therefore the five bits, 19 to 23, normally used to specify a destination register are used to specify the condition (one of 16 since bit 23 is not used by conditional instructions).

Reducing the Branch Penalty


If we're going to reduce the effect of branches on the performance of RISC processors, we need to determine the effect of branch instructions on the performance of the system. Because we cannot know how many branches a given program will contain, or how likely each branch is to be taken, we have to construct a probabilistic model to describe the system's performance. We will make the following assumptions: 1. Each non-branch instruction is executed in one cycle 2. The probability that a given instruction is a branch is pb

3. The probability that a branch instruction will be taken is pt 4. If a branch is taken, the additional penalty is b cycles If a branch is not taken, there is no penalty If pb is the probability that an instruction is a branch, 1 - pb is the probability that it is not a branch The average number of cycles executed during the execution of a program is the sum of the cycles taken for non-branch instructions, plus the cycles taken by branch instructions that are taken, plus the cycles taken by branch instructions that are not taken. We can derive an expression for the average number of cycles per instruction as: Tave = (1 - pb)1 + pbpt (1 + b) + pb (1 - pt) 1 = 1 + pbptb. This expression, 1 + pbptb, tells us that the number of branch instructions, the probability that a branch is taken, and the overhead per branch instruction all contribute to the branch penalty. We are now going to examine some of the ways in which the value of pbptb can be reduced.

Branch Prediction
If we can predict the outcome of the branch instruction before it is executed, we can start filling the pipeline with instructions from the branch target address (assuming the branch is going to be taken). For example, if the instruction is BRA N, the processor can start fetching instructions at locations N, N + 1, N + 2 etc., as soon as the branch instruction is fetched from memory. In this way, the pipeline is always filled with useful instructions. This prediction mechanism works well with an unconditional branch like BRA N. Unfortunately, conditional branches pose a problem. Consider a conditional branch of the form BCC N (branch to N on carry bit clear). Should the RISC processor make the assumption that the branch will not be taken and fetch instructions in sequence, or should it make the assumption that the branch will be taken and fetch instruction at the branch target address N? As we have already said, conditional branches are required to implement various types of high-level language construct. Consider the following fragment of high-level language code. if (J < K) I = I + L; (for T = 1; T <= I; T++)

{ . . } The first conditional operation compares J with K. Only the nature of the problem will tell us whether J is often less than K. The second conditional in this fragment of code is provided by the FOR construct that tests a counter at the end of the loop and then decides whether to jump back to the body of the construct or to terminate to loop. In this case, you could bet that the loop is more likely to be repeated than exited. Loops can be executed thousands of times before they are exited. Some computers look at the type of conditional branch and then either fill the pipeline from the branch target if you think that the branch will be taken, or fill the pipeline from the instruction after the branch if you think that it will not be taken. If we attempt to predict the behavior of a system with two outcomes (branch taken or branch not taken), there are four possibilities: 1. Predict branch taken and branch taken successful outcome 2. Predict branch taken and branch not taken unsuccessful outcome 3. Predict branch not taken and branch not taken successful outcome 4. Predict branch not taken and branch taken unsuccessful outcome Suppose we apply a branch penalty to each of these four possible outcomes. The penalty is the number of cycles taken by that particular outcome, as table 3 demonstrates. For example, if we think that a branch will not be taken and get instructions following the branch and the branch is actually taken (forcing the pipeline to be loaded with instructions at the target address), the branch penalty in table 3 is c cycles. Table 3 The branch penalty Prediction Branch taken Branch taken Branch not taken Branch not taken Result Branch taken Branch not taken Branch taken Branch not taken Branch penalty a b c d

We can now calculate the average penalty for a particular system. To do this we need more information about the system. The first thing we need to know is the probability

that an instruction will be a branch (as opposed to any other category of instruction). Assume that the probability that an instruction is a branch is pb. The next thing we need to know is the probability that the branch instruction will be taken, pt. Finally, we need to know the accuracy of the prediction. Let pc be the probability that a branch prediction is correct. These values can be obtained by observing the performance of real programs. Figure 12 illustrates all the possible outcomes of an instruction. We can immediately write: (1 - pb) = probability that an instruction is not a branch. (1 - pt) = probability that a branch will not be taken. (1 - pc) = probability that a prediction is incorrect. These equations are obtained by using the principle that if one event or another must take place, their probabilities must add up to unity. The average branch penalty per branch instruction is therefore Cave = a (pbranch_predicted_taken_and_taken) + b (pbranch_predicted_taken_but_not_taken) + c (pbranch_predicted_not_taken_but_taken) + d (pbranch_predicted_not_taken_and_not_taken) Cave = a (pt pc) + b (1 - pt) (1 - pc) + c pt (1 - pc) + d (1 - pt) pc Figure 12 Branch prediction

The average number of cycles added due to a branch instruction is Cave pb = pb (a pt pc + b (1 - pt) (1 - pc) + c pt (1 - pc) + d (1 - pt) pc). We can make two assumptions to help us to simplify this general expression. The first is that a = d = N (i.e., if the prediction is correct the number of cycles is N). The other simplification is that b = c = B (i.e., if the prediction is wrong the number of cycles is B). The average number of cycles per branch instruction is therefore: pb (N pt pc + B pt (1 - pc) + B (1 - pt) (1 - pc) + N (1 - pt) pc) = pb (N pc + B (1 - pc)). This formula can be used to investigate tradeoffs between branch penalties, branch probabilities and pipeline length. There are several ways of implementing branch prediction (i.e., increasing the value of pc). Two basic approaches are static branch

prediction and dynamic branch prediction. Static branch prediction makes the assumption that branches are always taken or never taken. Since observations of real code have demonstrated that branches have a greater than 50% chance of being taken, the best static branch prediction mechanism would be to fetch the next instruction from the branch target address as soon as the branch instruction is detected. A better method of predicting the outcome of a branch is by observing its op-code, because some branch instructions are taken more or less frequently that other branch instructions. Using the branch op-code to predict that the branch will or will not be taken results in 75% accuracy. An extension of this technique is to devote a bit of the op-code to the static prediction of branches. This bit is set or cleared by the compiler depending on whether the compiler estimates that the branch is most likely to be taken. This technique provides branch prediction accuracy in the range 74 to 94%. Dynamic branch prediction techniques operate at runtime and use the past behavior of the program to predict its future behavior. Suppose the processor maintains a table of branch instructions. This branch table contains information about the likely behavior of each branch. Each time a branch is executed, its outcome (i.e., taken or not taken is used to update the entry in the table. The processor uses the table to determine whether to take the next instruction from the branch target address (i.e., branch predicted taken) or from the next address in sequence (branch predicted not taken). Single-bit branch predictors provide an accuracy of over 80 percent and five-bit predictors provide an accuracy of up to 98 percent. A typical branch prediction algorithm uses the last two outcomes of a branch to predict its future. If the last two outcomes are X, the next branch is assumed to lead to outcome X. If the prediction is wrong it remains the same the next time the branch is executed (i.e., two failures are needed to modify the prediction). After two consecutive failures, the prediction is inverted and the other outcome assumed. This algorithm responds to trends and is not affected by the occasional single different outcome.

Problems
1. What are the characteristics of a CISC processor? 2. The most frequently executed class of instruction is the data move instruction. Why is this? 3. The Berkeley RISC has a 32-bit architecture and yet provides only a 13-bit literal. Why is this and does it really matter?

4. What are the advantages and disadvantages of register windowing? 5. What is pipelining and how does it increase the performance of a computer? 6. A pipeline is defined by its length (i.e., the number of stages that can operate in parallel). A pipeline can be short or long. What do you think are the relative advantages of longs and short pipelines? 7. What is data dependency in a pipelined system and how can its effects be overcome? 8. RISC architectures don't permit operations on operands in memory other than load and store operations. Why? 9. The average number of cycles required by a RISC to execute an instruction is given by Tave = 1 + pbptb. where The probability that a given instruction is a branch is pb The probability that a branch instruction will be taken is pt If a branch is taken, the additional penalty is b cycles If a branch is not taken, there is no penalty Draw a series of graphs of the average number of cycles per instruction as a function of pbpt for b = 1, 2, 3, and 4. 10. What is branch prediction and how can it be used to reduce the so-called branch penalty in a pipelined system?

You might also like