VLIW Philips
VLIW Philips
VLIW Philips
ABSTRACT VLIW architectures are distinct from traditional RISC and CISC architectures implemented in current mass-market microprocessors. It is important to distinguish instruction-set architecturethe processor programming modelfrom implementationthe physical chip and its characteristics. VLIW microprocessors and superscalar implementations of traditional instruction sets share some characteristicsmultiple execution units and the ability to execute multiple operations simultaneously. The techniques used to achieve high performance, however, are very different because the parallelism is explicit in VLIW instructions but must be discovered by hardware at run time by superscalar processors. VLIW implementations are simpler for very high performance. Just as RISC architectures permit simpler, cheaper high-performance implementations than do CISCs, VLIW architectures are simpler and cheaper than RISCs because of further hardware simplifications. VLIW architectures, however, require more compiler support.
Philips Semiconductors
Philips Semiconductors
INTRODUCTION AND MOTIVATION Currently, in the mid 1990s, IC fabrication technology is advanced enough to allow unprecedented implementations of computer architectures on a single chip. Also, the current rate of process advancement allows implementations to be improved at a rate that is satisfying for most of the markets these implementations serve. In particular, the vendors of general-purpose microprocessors are competing for sockets in desktop personal computers (including workstations) by pushing the envelopes of clock rate (raw operating speed) and parallel execution. The market for desktop microprocessors is proving to be extremely dynamic. In particular, the x86 market has surprised many observers by attaining performance levels and price/performance levels that many thought were out of reach. The reason for the pessimism about the x86 was its architecture (instruction set). Indeed, with the advent of RISC architectures, the x86 is now recognized as a deficient instruction set. Instruction set compatibility is at the heart of the desktop microprocessor market. Because the application programs that end users purchase are delivered in binary (directly executable by the microprocessor) form, the end users desire to protect their software investments creates tremendous instruction-set inertia. There is a different market, though, that is much less affected by instruction-set inertia. This market is typically called the embedded market, and it is characterized by products containing factory-installed software that runs on a microprocessor whose instruction set is not readily evident to the end user. Although the vendor of the product containing the embedded microprocessor has an investment in the embedded software, just like end users with their applications, there is considerably more freedom to migrate embedded software to a new microprocessor with a different instruction set. To overcome this lower level of instruction-set inertia, all it takes is a sufficiently better set of implementation characteristics, particularly absolute performance and/or price-performance. This lower level of instruction-set inertia gives the vendors of embedded microprocessors the freedom and initiative to seek out new instruction sets. The relative success of RISC microprocessors in the high-end of the embedded market is an example of innovation by microprocessor vendors that produced a benefit large enough to overcome the markets inertia. To the vendors disappointment, the benefits of RISCs have not been sufficient to overcome the instruction-set inertia of the mainstream desktop computer market. Because of advances in IC fabrication technology and advances in high-level language compiler technology, it now appears that microprocessor vendors are compelled by the potential benefits of another change in microprocessor instruction sets. As before, the embedded market is likely to be first to accept this change. The new direction in microprocessor architecture is toward VLIW (very long instruction word) instruction sets. VLIW architectures are characterized by instructions that each specify several independent operations. This is compared to RISC instructions that typically specify one operation and CISC instructions that typically specify several dependent operations. VLIW instructions are necessarily longer than RISC or CISC instructions, thus the name.
Philips Semiconductors
WHY VLIW? The key to higher performance in microprocessors for a broad range of applications is the ability to exploit fine-grain, instruction-level parallelism. Some methods for exploiting fine-grain parallelism include:
+ + + +
pipelining multiple processors superscalar implementation specifying multiple independent operations per instruction
Pipelining is now universally implemented in high-performance processors. Little more can be gained by improving the implementation of a single pipeline. Using multiple processors improves performance for only a restricted set of applications. Superscalar implementations can improve performance for all types of applications. Superscalar (super: beyond; scalar: one dimensional) means the ability to fetch, issue to execution units, and complete more than one instruction at a time. Superscalar implementations are required when architectural compatibility must be preserved, and they will be used for entrenched architectures with legacy software, such as the x86 architecture that dominates the desktop computer market. Specifying multiple operations per instruction creates a very-long instruction word architecture or VLIW. A VLIW implementation has capabilities very similar to those of a superscalar processorissuing and completing more than one operation at a timewith one important exception: the VLIW hardware is not responsible for discovering opportunities to execute multiple operations concurrently. For the VLIW implementation, the long instruction word already encodes the concurrent operations. This explicit encoding leads to dramatically reduced hardware complexity compared to a high-degree superscalar implementation of a RISC or CISC. The big advantage of VLIW, then, is that a highly concurrent (parallel) implementation is much simpler and cheaper to build than equivalently concurrent RISC or CISC chips. VLIW is a simpler way to build a superscalar microprocessor.
ARCHITECTURE VS. IMPLEMENTATION The word architecture in the context of computer science is often misused. Used accurately, architecture refers to the instruction set and resources available to someone who writes programs. The architecture is what is described in a definition document, often called a users manual. Thus, architecture contains instruction formats, instruction semantics (operation definitions), registers, memory addressing modes, characteristics of the address space (linear, segmented, special address regions), and anything else a programmer would need to know. An implementation is the hardware design that realizes the operations specified by the architecture. The implementation determines the characteristics of a microprocessor that are most often measured: price, performance, power consumption, heat dissipation, numbers of pins, operating frequency, and so on. Architecture and implementation are separate, but they do interact. As many researchers into computer architecture discovered between the mid 1970s and 1980s, architecture can have a dramatic effect on the quality of an implementation. In the mid 1980s, IC process technology could fabricate a microcoded implementation of a CISC instruction set and a tiny cache or MMU. For about the same cost, this same process technology could fabricate a pipelined implementation of a simple RISC instruction set (including 3
Philips Semiconductors
CISC
Varies Field placement varies Varies from simple to complex; possibly many dependent operations per instruction Few, sometimes special Bundled with operations in many different types of instructions Exploit microcoded implementations
RISC
One size, usually 32 bits Regular, consistent placement of fields Almost always one simple operation One size
VLIW
INSTRUCTION SEMANTICS
Many, general-purpose Not bundled with operations, i.e., load/store architecture Exploit implementations with one pipeline and & no microcode
Many, general-purpose Not bundled with operations, i.e., load/store architecture Exploit implementations with multiple pipelines, no microcode & no complex dispatch logic
TABLE 1
The differences between RISC, CISC, and VLIW are in the formats and semantics of the instructions. Table 1 compares architecture characteristics. CISC instructions vary in size, often specify a sequence of operations, and can require serial (slow) decoding algorithms. CISCs tend to have few registers, and the registers may be special-purpose, which restricts the ways in which they can be used. Memory references are typically combined with other operations (such as add memory to register). CISC instruction sets are designed to take advantage of microcode. RISC instructions specify simple operations, are fixed in size, and are easy (quick) to decode. RISC architectures have a relatively large number of general-purpose registers. Instructions can reference main 4
Philips Semiconductors
FIGURE 1
Philips Semiconductors
FIGURE 2
IMPLEMENTATION COMPARISON: SUPERSCALAR CISC, SUPERSCALAR RISC, VLIW The differences between CISC, RISC, and VLIW architectures manifest themselves in their respective implementations. Comparing high-performance implementations of each is the most telling. High-performance RISC and CISC designs are called superscalar implementations. Superscalar in this context simply means beyond scalar where scalar means one operations at a time. Thus, superscalar means more than one operation at a time. Most CISC instruction sets were designed with the idea that an implementation will fetch one instruction, execute its operations fully, then move on to the next instruction. The assumed execution model was thus serial in nature. RISC architects were aware of the advantages and peculiarities of pipelined processor implementations, and so designed RISC instruction sets with a pipelined execution model in mind. In contrast to the assumed CISC execution model, the idea for the RISC execution model is that an implementation will fetch one instruction, issue it into the pipeline, and then move on to the next instruction before the previous one has completed its trip through the pipeline. 6
Philips Semiconductors
Philips Semiconductors
SOFTWARE INSTEAD OF HARDWARE: IMPLEMENTATION ADVANTAGES OF VLIW A VLIW implementation achieves the same effect as a superscalar RISC or CISC implementation, but the VLIW design does so without the two most complex parts of a high-performance superscalar design. Because VLIW instructions explicitly specify several independent operationsthat is, they explicitly, specify parallelismit is not necessary to have decoding and dispatching hardware that tries to reconstruct parallelism from a serial instruction stream. Instead of having hardware attempt to discover parallelism, VLIW processors rely on the compiler that generates the VLIW code to explicitly specify parallelism. Relying on the compiler has advantages. First, the compiler has the ability to look at much larger windows of instructions than the hardware. For a superscalar processor, a larger hardware window implies a larger amount of logic and therefore chip area. At some point, there simply is not enough of either, and window size is constrained. Worse, even before a simple limit on the amount of hardware is reached, complexity may adversely affect the speed of the logic, thus the window size is constrained to avoid reducing the clock speed of the chip. Software windows can be arbitrarily large. Thus, looking for parallelism in a software window is likely to yield better results. Second, the compiler has knowledge of the source code of the program. Source code typically contains important information about program behavior that can be used to help express maximum parallelism at the instruction-set level. A powerful technique called trace-driven compilation can be employed to dramatically improve the quality of code output by the compiler. Trace-drive compilation first produces a suboptimal, but correct, VLIW program. The program has embedded routines that take note of program behavior. The recorded program behaviorwhich branches are taken, how often, etc.is then used by the compiler during a second compilation to produce code that takes advantage of accurate knowledge of 8
Philips Semiconductors
THE ADVANTAGE OF COMPILER COMPLEXITY OVER HARDWARE COMPLEXITY While a VLIW architecture reduces hardware complexity over a superscalar implementation, a much more complex compiler is required. Extracting maximum performance from a superscalar RISC or CISC FIGURE 4 implementation does require sophisticated compiler techniques, but the level of sophistication in a VLIW compiler is significantly higher. VLIW simply moves complexity from hardware into software. Luckily, this trade-off has a significant side benefit: the complexity is paid for only once, when the compiler is written instead of every time a chip is fabricated. Among the possible benefits is a smaller chip, which leads to increased profits for the microprocessor vendor and/or cheaper prices for the customers that use the microprocessors. Complexity is usually easier to deal with in a software design than in a hardware design. Thus, the chip may cost less to design, be quicker to design, and may require less debugging, all of which are factors that can make the design cheaper. Also, improvements to the compiler can be made after chips have been fabricated; improvements to superscalar dispatch hardware require changes to the microprocessor, which naturally incurs all the expenses of turning a chip design. PRACTICAL VLIW ARCHITECTURES AND IMPLEMENTATIONS The simplest VLIW instruction format encodes an operation for every execution unit in the machine. This makes sense under the assumption that every instruction will always have something useful for every execution unit to do. Unfortunately, despite the best efforts of the best compiler algorithms, it is typically not possible to pack every instruction with work for all execution units. Also, in a VLIW machine that has both integer and floating-point execution units, the best compiler would not be able to keep the floatingpoint units busy during the execution of an integer-only application. 9
Philips Semiconductors
HISTORICAL PERSPECTIVE VLIW is not a new computer architecture. Horizontal microcode, a processor implementation technique in use for decades, defines a specialized, low-level VLIW architecture. This low-level architecture runs a microprogram that interprets (emulates) a higher-level (user-visible) instruction set. The VLIW nature of the horizontal microinstructions is used to attain a high-performance interpretation of the high-level instruction set by executing several low-level steps concurrently. Each horizontal microcode instruction encodes many irregular, specialized operations that are directed at primitive logic blocks inside a processor. From the outside, the horizontally microcoded processor appears to be directly running the emulated instruction set. In the 1980s, a few small companies attempted to commercialize VLIW architectures in the general-purpose market. Unfortunately, they were ultimately unsuccessful. Multiflow is the most well known. Multiflows founders were academicians who did pioneering, fundamental research into VLIW compilation techniques. Multiflows computers worked, but the company was probably about a decade ahead of its time. The Multiflow machines, built from discrete parts, could not keep pace with the rapid advances in single-chip microprocessors. Using todays technology, they would have a better chance at being competitive. In the early 1990s, Intel introduced the i860 RISC microprocessor. This simple chip had two modes of operation: a scalar mode and a VLIW mode. In the VLIW mode, the processor always fetched two instructions and assumed that one was an integer instruction and the other floating-point. A single program could switch (somewhat painfully) between the scalar and VLIW modes, thus implementing a crude form of code compression. Ultimately, the i860 failed in the market. The chip was positioned to compete with other general-purpose microprocessors for desktop computers, but it had compilers of insufficient quality to satisfy the needs of this market.
10
Philips Semiconductors
11
Pub#: 9397-750-01759