Design and Implementation of A Soc Reconfigurable Computing Architecture For Multimedia Applications
Design and Implementation of A Soc Reconfigurable Computing Architecture For Multimedia Applications
Design and Implementation of A Soc Reconfigurable Computing Architecture For Multimedia Applications
Target applications:
1
In this proposal, the application domain of interest is the real time image and video processing
for multimedia applications. Real-time image and video processing has long played a key role in
industrial inspection systems and will continue to do so while its domain is being expanded into
multimedia-based consumer electronics products, such as digital and cell-phone cameras, and
intelligent video surveillance systems. In general, real-time image and video processing systems
involve processing vast amounts of image data in a timely manner for the purpose of extracting
useful information, which could mean anything from obtaining an enhanced image to intelligent
scene analysis.
Digital images and video are essentially multidimensional signals and are thus quite data
intensive, requiring a significant amount of computation and memory resources for their
processing. In fact, much of what goes into implementing an efficient image/video processing
system centers on how well the implementation, both hardware and software, exploits different
forms of parallelism in an algorithm, which can be data level parallelism (DLP) or/and instruction
level parallelism (ILP). DLP manifests itself in the application of the same operation on different
sets of data, while ILP manifests itself in scheduling the simultaneous execution of multiple
independent operations in a pipeline fashion.
Traditionally, image/video processing operations have been classified into three main levels,
namely low, intermediate, and high, where each successive level differs in its input/output data
relationship. Low-level operators take an image as their input and produce an image as their
output, while intermediate-level operators take an image as their input and generate image
attributes as their output, and finally high-level operators take image attributes as their inputs and
interpret the attributes, usually producing some kind of knowledge-based control at their output.
In general, low-level operations are excellent candidates for exploiting DLP. Some
intermediate-level operations are also data intensive with a regular processing structure, thus
making them suitable candidates for exploiting DLP. High-level operations and due to their
irregular structure and low-bandwidth requirements, such operations are suitable candidates for
exploiting ILP, although their data-intensive portions usually include some form of matrix–vector
operations that are suitable for exploiting DLP.
2
3. Some images have pixels with twelve bits (or more) of information, while others have only eight
or four bits or precision. Moreover, many operators (e.g. thresholding) create binary images for
later processing. Systems that can compute with different word sizes therefore have a significant
advantage.
Video image processing is typified by high data rates (187.5 Mbytes/sec for real time HDTV),
making an efficient method of data transferred between host and platform important. The memory
system must also be able to cope with the high data rates. The structure of many video image
processing tasks can often be decomposed into pipelined sub-operations. To give good
performance, the architecture that will be suggested should be able to exploit this kind of
pipelined, stream-based processing.
From the literature, one can see there are three major architectural features that are essential to
any image/video processing system, namely single instruction multiple data (SIMD), very long
instruction word (VLIW), and an efficient memory subsystem. While SIMD can be used for
exploiting DLP, VLIWcan be used for exploiting instruction level parallelism (ILP), and thus for
speeding up high-level operations. VLIW furnishes the ability to execute multiple instructions
within one processor clock cycle, all running in parallel, Of course, while SIMD and VLIW can
help speed up the processing of diverse image/video operations, the time saved through such
mechanisms would be completely wasted if there did not exist an efficient way to transfer data
throughout the system. Thus, an efficient memory subsystem is considered a crucial component of
a real-time image/video processing system, especially for low-level and intermediate-level
operations that require massive amounts of data transfer bandwidth as well as high-performance
computation power. Concepts such as direct memory access (DMA) and internal versus external
memory are important. DMA allows transferring of data within a system without burdening the
CPU with data transfers. DMA is a well-known tool for hiding memory access latencies, especially
for image data. Efficient use of any available on-chip memory is also critical since such memory
can be accessed at a faster rate than external memory.
3
MorphoSys (Morphoing System [2] (2D Mesh-Based Architectures) has a MIPS-like “TinyRISC”
processor with extended instruction set, a mesh connected 8 by 8 reconfigurable array (RA), a
frame buffer for intermediate data, context memory, and DMA controller (see figure below). The
RA is divided into four quadrants of 4 by 4 16 bit RCs each, featuring ALU, multiplier, shifter,
register file, and a 32 bit context register for storing the configuration word. The interconnect
network features 3 layers: 4 NN ports, links of distance 2, and, interquadrant buses spanning the
whole array.
TinyRISC extra DMA instructions initiate data transfers between the main memory and the
“frame buffer” internal data memory for blocks of intermediate results, 128 by 16 bytes in total.
The CHESS Array. The CHESS hexagonal array [3] features a chessboard-like floorplan with
interleaved rows of alternating ALU / switchbox sequence (figure below). Embedded RAM areas
support high memory requirements. Switchboxes can be converted to 16 word by 4 bit RAMs if
needed. RAMs within switchboxes can also be used as a 4-input, 4-output LUT. The interconnect
fabrics of CHESS has segmented four bit buses of different length. There are 16 buses in each row
and column, 4 buses for local connections spanning one switchbox, 4 buses of length 2, and 2
buses of length 4, 8 and 16 respectively.
4
To avoid routing congestion, the array features also embedded 256 word by 8 bit block RAMs.
An ALU data output may feed the configuration input of another ALU, so that its functionality can
be changed on a cycle-per-cycle basis at runtime without uploading. However, partial
configuration by uploading is not possible.
SONIC Architecture [4] is designed to support the software plug-in methodology to accelerate
video image processing applications. SONIC differs from other architectures through the use of
Plug-In Processing Elements (PIPEs) and the Application Programmer’s Interface (API). Each
PIPE contains a reconfigurable processor, a scalable router that also formats video data, and a
frame-buffer memory. The SONIC architecture integrates multiple PIPEs together using a
specialized bus structure which enables flexible and optimal pipelined processing.
SONIC-1 communicates with the host PC through the PCI bus and has 8 PIPEs. The overall
SONIC architecture consists of a number of Plug-In Processing Elements (PIPEs), connected by
the PIPE bus, and PIPE Flow buses. Figure above gives an overview of the SONIC architecture.
SONIC’s bus architecture consists of a shared global bus combined with a flexible pipeline bus.
This allows the SONIC architecture to implement a number of different computational schemes.
Proposal in details:
5
There are two general purpose fields of interest for this projects : increasing performance of
the FPGA-based design and for reconfigurability. In general, the ideal architecture would have the
following characteristics: high performance, flexibility, easy upgradability, low development cost,
and a migration path to lower cost as the application matures and volume ramps.
I will study the existing current architectures for reconfigurable multimedia and image
processing systems and platform-based design applied to single FPGA systems, and I will
investigate the set of criteria that are characterize the design of reconfigurable system such as
granularity, depth of programmability, reconfigurability, interface and computation models. The
multimedia tasks ( as video compression/decompression and image enhancement processing ) will
be profiled to identify frequent and time critical operations. I will attempts to improve upon some
of the shortcomings associated with existing video /image reconfigurable systems .
The initial ideas about the suggested architecture is a system that uses the parallelism at
multiple levels of granularity within these applications which can be exploited to obtain the
maximum computation throughput. Reconfigurable architectures that will be suggested, exploit
fine-grain (more suitable for low level and medium level operations in the image processing chain)
and coarse-grain parallelism (more suitable for high level operations in the image processing
chain).
The system will include array of reconfigurable processing elements (PEs) to provide high
computational density and sufficient register and memory resources for implementing multimedia
and image processing algorithms. I suggest architecture that support SIMD parallelism (as often
found in video encoding/decoding) to achieve low level (DIP) operation. And at the same time the
architecture support MIMD programs (by providing each PE with a local instruction RAM) to
achieve high level (ILP) operations. All these models are connected by a flexible communication
network.
The reconfigurable array directly access a dedicated “frame buffer” memory which acts as a
local data cache. Each PE has an ALU, multiplier, a register file, configuration register and many
I/O operations. The system will also include a customized RISC processor to handle the general
purpose operations, controls operation of the PEs array using special instruction designed for this
purpose, and initiates all data transfers configuration programs, (combining the advantages of both
microprocessor and FPGA resources into a single system) so that an entire standalone system can
be achieved. Commercial customizable processors, which are customizable by the user at design time,
include the Altera NIOS, and the Xilinx MicroBlaze. Customization involves selecting options for various
processor features, such as cache size and type, data-path bit-widths, floating point processing and so on.
Along with the general purpose instruction for arithmetic, logical and shift functions, other
instructions helpful to image processing applications will also included in the ALU. These include
maximum, minimum, average, and combined shift/add operations.
Another possible architecture which may be proposed is designing a coarse-grained
reconfigurable array of PEs and connected in MIMD architecture as an FPGA-based accelerator
board. This board is then used as a coprocessor to enhance a standard computer workstation by
attaching it to the computer expansion bus. The reconfigurable array of PEs will accelerate only
the most critical computation kernels of the program. Data transfer between the frame buffer
contained in the accelerator board and the main memory in the workstation is handled by using
the direct memory access (DMA) to allow fast transferring of data within the system, which
is a well-known tool for hiding memory access latencies, especially for image data.
6
These ideas will be submitted to extensive investigations during my study to find the must
suitable one. Finally, the performance and efficiency of the proposed system will be measured and
evaluated comparing to other existing reconfigurable multimedia systems.
Design tools:
With the continual growth in size and functionality of FPGAs (Field Programmable Gate
Arrays) there has been increasing interest in their use as implementation platforms for image
processing applications, particularly real time video processing. Due to their structure with a large
array of parallel logic and registers, FPGAs can exploit the data parallelism found in images. They
can either performing all required operations or perform a subset of operations to reduce data
before passing processing to a standard DSP or microprocessor. Due to their programmable nature,
FPGAs can be programmed to exploit different types of parallelism inherent in an image/video
processing algorithm. This in turn leads to highly efficient real-time image/video processing for
low-level, intermediate-level, or high-level operations, enabling an entire imaging system to be
implemented on a single FPGA. In many cases, FPGAs have the potential to meet or exceed the
performance of a single DSP or multiple DSPs.
For example, the Xilinx Virtex-4 FPGAs, built with 90-nm CMOS technology, have up to 512
multiply-accumulate (MAC) units, each with a dedicated 18-bit multiplier followed by a 48-bit
accumulator, operating at greater than 500 MHz. Such an FPGA can achieve a peak performance
of 256 billion MAC operations per second or 512 GOPS, using only the dedicated cores. If an
implementation also uses a reconfigurable fabric, peak performance can reach as high as 1 TOPS
(16-bit operations)—a 100 to 1,000 times higher throughput than that of any existing commercial
microprocessor or DSP.
Therefore, and based on these considerations, FPGAs are particularly well suited to meet the
requirements of many video and image processing. I will use one of these commercially available
FPGA-based development boards. I’ll look at the requirements and availability of FPGA
development platforms for image processing applications and what hardware resources are needed
to support elements that are common to the most popular image analysis techniques such as High-
Speed I/O, and Frame Stores. I will use an extension to C+ + based languages, as SystemC to
support the requirements of reconfigurable SoCs, and to allow fast design implementation.
References :
[1] T. Miyamori and K. Olukotun: REMARC: Reconfigurable Multimedia Array Coprocessor;
Proc. ACM/SIGDA FPGA‘98, Monterey, Feb. 1998.
[3] A. Marshall et al.: A Reconfigurable Arithmetic Array for Multimedia Applications; Proc.
ACM/SIGDA FPGA‘99, Monterey, Feb. 21-23, 1999