Design and Implementation of Integer Transform and Quantization Processor For H.264 Encoder On FPGA

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/224097950
Design and Implementation of Integer Transform and Quantization Processor

for H.264 Encoder on FPGA
Conference Paper · January 2010

DOI: 10.1109/ACT.2009.164 · Source: IEEE Xplore
CITATIONS READS
5 610
3 authors, including:
N. Keshaveni Gurumurthy Kargal

Dr. M.G.R. University 56 PUBLICATIONS 275 CITATIONS
3 PUBLICATIONS 16 CITATIONS
SEE PROFILE
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
power optimised techniques for VLSI Testing View project
All content following this page was uploaded by N. Keshaveni on 21 July 2015.
The user has requested enhancement of the downloaded file.

International Journal of Computer Science & Communication Vol. 1, No. 1, January-June 2010, pp. 43-50
Design and FPGA Implementation of Integer Transform and Quantization

Processor and Their Inverses for H.264 Video Encoder
N. Keshaveni1, S. Ramachandran2 & K.S. Gurumurthy3
1
Department of Electronics & communication Eng., Dr MGR University, Chennai, India
2
National Academy of Excellence, Bangalore, India
3
Department of Electronics & communication Eng., UVCE, Bangalore, India
1
[email protected], [email protected], [email protected]
ABSTRACT
This paper proposes a novel implementation of the core processors, intra prediction, the integer transform, quantization,
inverse quantization and inverse transformation for H.264 Video Encoder using an FPGA. It is capable of processing
video frames with the desired compression controlled by the user input. The algorithm and architecture of the core
modules of the video encoder namely, horizontal mode of intra prediction, the integer transform, quantization, inverse
quantization and inverse transformation were developed, designed and coded in Verilog. The complete H.264
Advanced Video Codec was coded in Matlab in order to verify the results of the Verilog implementation. The processor
is implemented on a Xilinx Vertex–II Pro XUPVP30 FPGA. The gate count of the implementation is approximately
870,000. It can process 1024x768 pixels moving color pictures in 4:2:0 format at 25 frames per second. The reconstructed
picture quality is better than 35 dB.
Keywords: Intraprediction, Integer transform, quantization, Inverse quantization, Inverse transform, Codec, Verilog,
FPGA.
1. INTRODUCTION present work. A novel parallel algorithm is presented in

With the widespread use of technologies like digital Section 4 for evaluating the transform and quantization
television, internet streaming video and DVD video, suitable for high speed implementation on FPGA/ASIC.
video compression has become an inevitable component Section 5 presents detailed architectures of Intra
of broadcast and entertainment media. Currently, the Prediction, Integer Transform and Quantization
video codec that achieves the highest data compression Processors and their Inverses. Results and discussions
without sacrificing on the picture quality is the MPEG-4 are presented in Section 6. The FPGA implementation
Part 10 Advanced Video Coding, also known as the results of the design is presented in the next section and
H.264 [1-3]. This codec has many new features such as conclusions are presented in the last section.
intra-frame prediction, 4x4 integer transform,
quantization, context adaptive entropy coding, 2. BLOCK DIAGRAM OF H.264 ADVANCED VIDEO
deblocking filter etc., which were not available in the ENCODER AS IMPLEMENTED
earlier standards. The present work has realized some This section presents the overall building blocks of the
of the above features. The implementation conforms to Advanced Video Encoder as well as the functional
the baseline, main as well as extended profiles since only modules implemented in the present work. The block
Intra (I) frames are used. While implementations of the diagram of H.264 Encoder [7, 8] is shown in Fig.1. Therein
integer transform [4, 5] on hardware have been reported, the modules designed in this work are shown in grey
no implementation of intraprediction has been found to shades. An input frame or field Fn is processed in units
the best of authors’ knowledge. Qiang Peng et al. [6] have of a macro block. A macro block consists of 16x16 pixels.
reported an implementation of the H.264 encoder using Each macro block is encoded in intra or inter mode and
a 32-bit RISC CPU on a single chip running on Linux for each block in the macro block, a prediction P is formed
operating system, which can process PAL, SECAM or
based on the reconstructed picture samples [9]. In Intra
NTSC video at 80 MHz.
mode, P is formed from samples in the current slice that
This paper is organized as follows: In the following have been previously reconstructed. In Inter mode, P is
Sections 2 and 3, the basic building blocks of an AVC formed by motion-compensated prediction from one or
encoder and the principles involved are described. It also two reference picture(s) selected from the set of reference
highlights the actual modules implemented in the pictures.
44 International Journal of Computer Science & Communication (IJCSC)
mode, a prediction block is formed based on previously

encoded and reconstructed blocks. This prediction block
is subtracted from the current block prior to encoding.
These details are explained clearly in the following
paragraphs. The H.264 standard recommends the use of
nine different prediction modes, namely, vertical,
horizontal, DC, diagonal down left, diagonal down right,
vertical right, vertical left, horizontal down and
horizontal up from which one or more may be used to
form the predicted block [11-14]. There are a total of 9
optional prediction modes for each 4x4 luma block; 4
Fig 1: Modules Implemented in H.264 Video Encoder
optional modes for a 16x16 luma block; and one mode
In Fig.1, the reference picture is shown as the that is always applied to each 4x4 chroma block. For the
previously encoded picture F n”1, but the prediction luminance (luma) samples, prediction may be formed
reference for each macro block partition (in inter mode) for each 4x4 sub block or for a 16x16 macro block. In the
may be chosen from a selection of past or future pictures present work, the horizontal mode of intra prediction
(in display order) that have already been encoded, and the 4x4 luma block are used.
reconstructed and filtered. The prediction P is subtracted
from the current block to produce a residual (difference)
block that is transformed using a block transform and
quantized to give a set of quantized transform coefficients
which are reordered and entropy encoded [10]. The
entropy-encoded coefficients together with other
information required to decode each block within the
macro block form the compressed bit stream, which is
passed on to a Network Abstraction Layer (NAL) for
transmission or storage. The extra information required
to decode each block are typically prediction modes,
quantization parameter, motion vector information, etc.
Apart from encoding and transmitting each block in a
macro block, the encoder also decodes, i.e., reconstructs
it to provide a reference for further predictions. The
quantized coefficients are scaled, inverse quantized and
inverse transformed to produce a difference block. The
prediction block P is added to the difference block to get
the reconstructed block. A filter is applied to reduce the Fig 2: Basic Principle of Horizontal Mode of Intra-prediction
effects of blocking distortion and the reconstructed
reference picture is generated as F’n. a. Order of Sub block processing in a Macro block;
In the present work, the core processors such as the b. Horizontal Mode Prediction block (shaded part)
intra prediction, integer transform, quantization, inverse for processing the current sub block;
quantization and inverse transform (TQIQIT) were c. TQIQIT Processing (Reconstruction of Residual
implemented. They are shown shaded in the figure. The sub block).
design of remaining modules is involved and the
The best prediction mode would be the one by which
development of the same is under progress.
the predicted block most closely matches the actual block.
In order to accomplish that, the predicted block would
3. ARCHITECTURAL DETAILS OF HORIZONTAL MODE
have to be generated using all the modes, then compared
OF INTRAPREDICTION
with the actual block to find out the best mode. This
Within a picture frame, pixels close to each other tend to would involve enormous amount of computation. Also,
have similar values. Intraprediction is done in order to until the best mode is found, the processing would be
exploit the spatial redundancy within a frame. Each pixel stalled thereby bringing down the throughput of the
is predicted based on the values of its neighboring pixels encoder. Further, no particular mode offers better
that are available. Instead of processing the pixel value, compression than the others, and which changes
only the difference between the actual value of the pixel dynamically with the picture being processed. Of these
and its predicted value known as the residual pixel is nine modes, the horizontal mode of prediction is aptly
processed. If a block or macro block is encoded in intra suited for fast implementation as an ASIC or an FPGA
Design and FPGA Implementation of Integer Transform and Quantization Processor... 45
consuming minimum hardware. For these reasons, the transform and quantization in an earlier work [15]. The
horizontal mode of intraprediction has been chosen for core integer transform is expressed as two-stage matrix
this implementation. multiplication as shown in Eq. 1. The values X00 to X33 are
the residual pixel inputs from the intra-prediction stage
The basic principle involved in the horizontal mode
contained in matrix X as described in the previous
of intra prediction implemented in this work is shown
in Fig. 2. A picture is processed macro block by macro section. C and C’ (the transpose of C) are constant
block in the order from top to bottom and from left to matrices. W, containing elements W00 to W33 is a matrix
right. A macro block consists of 16x16 pixels. These are of coefficients after transforming the matrix X.
further divided into 4x4 pixels sub blocks: B0 to B15 as  W00 W01 W02 W03  1 1 1 1 
shown in Fig. 2(a). These sub blocks are processed in the W W11 W12 W13   2 1 −1 −2 
order: B0, B1, .., B3; B4, B5, ….., B15. As an example, B6  10   
is shown as the current sub block that is required to be  W20 W21 W22 W23  =  1 −1 −1 1 
processed. The pixel values of this sub block are ‘p1’,  W W31 W32 W33   1 −2 2 −1
30
‘p2’, …, ‘p16’. It may be noted that just before this sub
block, B5 was already processed.  X00 X01 X02 X03  1 2 1 1
X X11 X12 X13   1 1 −1 −2 
The shaded part in Fig. 2(b) shows the horizontal  10   
mode prediction block for processing the current sub  X 20 X 21 X 22 X 23   1 −1 −1 2 
block. As shown therein, ‘d’, ‘c’, ‘b’ and ‘a’ pixels serve  X X 31 X 32 X 33   1 −2 1 −1
30
as the prediction for the current sub block B6. These
prediction pixels belong to the last (4th) column of the Or in short, W = C * X * C’ (1)
previously reconstructed sub block B5. The pixels ‘e’ to
Each of the transformed coefficients Wij is quantized
‘m’ is the already reconstructed last row pixels of the
by a scalar quantizer as specified in Ref. [1]. A total of 52
upper sub blocks. However, these pixels are not used in
the horizontal prediction. It may be noted that the values of quantization step size (Qstep) are supported
prediction for the current sub block being processed is by the standard and these are indexed by a Quantization
always the last column of the recently reconstructed sub Parameter, QP. Qstep doubles in size for every increment
block. In the above example, B5 offers prediction for B6. of 6 in QP. The wide range of quantizer step sizes makes
As another example, the reconstructed last column of it possible for an encoder to accurately and flexibly
B0 forms the prediction for the current sub block B1. As control the trade-off between bit rate and quality. The
yet another example, the reconstructed last column of quantized coefficients are computed as:
B11 forms the prediction for the current sub block B12. Zij = floor (Wij * MF / 2qbits) (2)
Fig. 2(c) shows the reconstruction of the current sub where qbits = 15 + floor (QP/6) and MF is a multiplication
block (B6 in this example) by processing TQIQIT. For factor specified in the H.264 reference model software
processing the integer transform, the residual values of
of the standard. The algorithm for the integer transform
the sub block (and not the actual pixel values) are taken
and quantization is as follows:
as the inputs. The residual values are got by taking the
pixel-wise difference between the current sub block (B6) 1. Multiply/add the first row of C with each
and the prediction sub block (B5). The processing of column of X one after another to generate the
TQIQIT is explained in detail in subsequent sections. The first row of partial products, P 00 – P 03 .
reconstructed residual pixels of the sub block are Multiplications involved are trivial since 1, -1,
obtained after processing of TQIQIT. Subsequently these 2, -2 are the multiplying constants.
reconstructed residual pixels are added with the
corresponding prediction sub block pixels to get the 2. Multiply/add the second row of C with each
reconstructed sub block (B6 being an example). For the column of X one after another to generate the
first sub block B0 of a macro block no pixels are available second row of partial products, P 10 – P 13 .
to generate the predicted blocks. Therefore, the predicted Concurrently multiply the first row of partial
block of such a block has its entire pixel values as “0”, products P00 – P03 (generated in the previous step)
i.e., the block is processed without prediction. with each of the columns of C’ one after another
to generate the first row of integer transformed
4. ALGORITHM FOR PARALLEL MATRIX MULTIPLICATION coefficients. Pipeline the quantization
OF INTEGER TRANSFORM, QUANTIZATION AND (multiplication with MF) as per Eq. 2
THEIR INVERSES immediately after each integer coefficient Wij is
generated. It may be noted that the computation
4.1. Integer Transform and Quantization 2qbits is just right shift operation dispensing with
A novel parallel algorithm that is capable of being highly division. In this step, the quantized coefficients
pipelined has been developed for computing the integer Zij are generated.
3. Repeat the step 2 for the third and fourth rows components are valid when the “datain_valid” signal is
of C to generate the rest of the sixteen quantized asserted. The luminance and chrominance components
coefficients. are written into a “dual RAM” at the rising edge of
“write_clk” signal. Thus, the RAMs store two blocks of
4.2. Inverse Quantization and Inverse Integer 16 lines, i.e., two macro block rows. A macro block
Transform consists of 16x16 pixels. As one RAM buffer gets filled,
the intraprediction is processed concurrently from the
The inverse quantization is expressed as
other buffer previously filled. The stored data is read
Wiji = Zij * Vij * 2floor(QP/6) (3) from the “dual RAM” for further processing at the rising
edge of “read_clk” signal. The system is reset at the time
where Zij are the quantized coefficients and Vij are the
of powering on using an asynchronous active low signal
rescaling factors dependent upon the coefficient position
“reset_n”. Just as a microprocessor may be halted at any
as specified in the H.264 standard. The inverse integer
point of time, the TQIQIT processing may also be
transform that follows the inverse quantization stage is
temporarily suspended using the “halt” signal in order
as follows:
to allow the processor CAVLC to catch up with the
X = Ci' * Wi * Ci (4) TQIQIT processor. The desired compression may be set
by the quantization parameter “Qstep_in [1:0]”, which
where
is user configurable. After processing the desired data
1 1 1 1  using chrominance components “pix_cb_rec_out” and
 1 1 2 − 1 2 −1  “pix_cr_rec_out” are output along with their
 
Ci =  1 −1 −1 1  corresponding valid signals. “q_coef” is the output after
 1 −1 1 − 1  quantization and it is fed to CAVLC Processor for
2 2
effecting compression.
and Ci' is its transpose.
The algorithm for the inverse quantization and the
inverse integer transform is similar to that of forward
transform and quantization and, therefore, not presented
here.
5. ARCHITECTURE OF INTRA PREDICTION, INTEGER

TRANSFORM AND QUANTIZATION PROCESSORS
AND THEIR INVERSES
The basic architecture of Horizontal mode of Intra
Prediction, Integer Transform and Quantization (TQ) and
their Inverses (IQIT) of H.264 Advanced Video Encoder
as implemented in the present work is shown in Fig. 3.
Fig 3: Functional Modules of the Advanced Video
The video encoder brings about the compression of video Encoder Implemented
signals, which is vital in bringing down the storage and
serial channel band width over which the compressed
5.1. Detailed Architecture of the Intra-prediction
bit stream is transmitted. A video sequence such as that
Module
coming from the output of a camera decoder is input to
the first stage, the format converter, which converts the The intra-prediction module consists of double buffered
4:2:2 format luminance (Y) and chrominance (Cb, Cr) RAMs for each of the components Y, Cb and Cr. The
components of a color motion picture to the standard output of the dual RAM, which stores the current sub
4:2:0 format. This format has less number of pixels to be block being processed, is the input to the intra-prediction
processed than the 4:2:2 format, thus resulting in less module. This module consists of two sub modules,
processing time and more compression. namely, the “four_pix_out_y” module for storing a
current sub block and the “ram_predict_y” module for
The 4:2:0 Y, Cb, Cr components are intra-predicted, storing the predicted pixel values and the reconstructed
transformed, quantized, inverse quantized and inverse pixel values needed to generate the predicted block as
transformed in order to get the reconstructed picture shown in Fig. 4. The current sub block pixel values from
output. The quantized coefficients are then fed to the the “dual RAM” module is input to the module
Context Adaptive Variable Length Coder (CAVLC) “four_pix_out_y” using the data bus “pix_y_dram_in
module [16], which assigns variable length codes to get [7:0]”. Its validity is signaled by simultaneously asserting
the desired compressed bit stream. These pixel the signal “pix_y_dram_val”.
The “four_pix_out_y” module outputs current pixel In the next module called “intrapred_mem_y”, the
values, one column pixel values at a time at the pins difference between the current sub block pixel values
marked “pix_fpo0” to “pix_fpo3”, with the signal “pix_fpo0” to “pix_fpo3” and the predicted values gives
“pix_fpo_valid” serving as their valid signal. The entire the residual values “pix0_y_res” to “pix3_y_res” with
sub block is output in 4 clock cycles. The pixels output “pix_y_res_valid” signal asserted. These (four) values
in these clock cycles are p1, p5, p9, p13 in the first cycle, are fed to the TQIQIT module to get back the
p2, p6, p10, p14 in the second cycle, p3, p7, p11, p15 in reconstructed residual values (pix_y_res_rec). The
the third cycle and p4, p8, p12, p16 in the last cycle. These reconstructed residual values in turn are added to the
are the current sub block pixel values presented in Fig. 2
predicted values (d, c, b, a) to get the reconstructed sub
earlier.
block as described in the “ram_predict_y” module. The
signal “pix_y_req_dram_out” is pixel request to dual
ram. When this signal is high, the dual ram outputs pixels
to the intra-predict module.
5.2. Detailed Architecture of TQIQIT Processor

The TQIQIT module consists of transformation,
quantization, Inverse quantization, dual RAM and
inverse transformation modules. These modules are
shown in Fig.5. The four residual pixel values “pix0_res”
to “pix3_res” after intra-prediction are fed to the
transformation module marked “transform”. These pixel
values are valid when the “pixel_valid” signal is asserted.
The output of the transform module is the transformed
coefficient “t_coef” and the validity of the data is asserted
by the signal “t_coef_val”.
Fig 4: Architecture of Intra-prediction Module
The module “ram_predict_y” contains the predicted

values, i.e., the last column of the reconstructed pixels
(d, c, b, a) of the previously processed sub block. The
TQIQIT module computes the reconstructed residual
values “pix_y_res_rec” and are added with the
previously reconstructed pixels (d, c, b, a) in this
“ram_predict_y” module. This is the reconstructed value
(pix_y_rec_out) of the current sub block. The last column
pixel values of this reconstructed sub block are also
output as “pix_pred0” to “pix_pred3” with
“pix_pred_valid” as the valid signal. These are internally
stored as (d, c, b, a) to serve as the predicted values for
processing the next sub block. Fig 5: Architecture of TQIQIT Processor
The transformed coefficients are fed to the quantizer. program. This is still in Macro block/Sub block order.
The quantization is performed according to the Eq. 2. This is finally converted to a “tif” format using another
The signal “q_rsh” is the external input to the quantizer Matlab program. This program automatically displays
module used to decide the desired compression with the both the original picture as well as the reconstructed
help of Q_step. The quantization is performed by right picture. The Matlab program also computes the quality
shift operation. The output of the quantizer is the of the reconstructed picture referred to as PSNR
quantized coefficient “q_coef” and the validity of the data expressed in dB.
is asserted by the signal “q_coef_val”. These outputs are
The simulated waveforms are shown in Fig. 6 and 7.
fed as inputs to the next Processing module CAVLC,
The reconstructed picture is generated at every rising
which is not part of this work. edge of “read_clk” with latency coming into play for
After quantization, the coefficients are fed to the every sub block processed. The first Sub block
inverse quantization module. The signal “q_lsh” is used reconstructed pixel values “pix_Y_rec_out”,
as the inverse quantization parameter. The inverse “pix_Cb_rec_out” and “pix_Cr_rec_out” and their
quantization is performed by left shift operation. The corresponding valid signals are shown in Fig.6 and 7.
output of the inverse quantizer is the signal “inv_coef” From these waveforms, we observe that the reconstructed
and the validity of the data is asserted by the signal data “pix_Y_rec_out” commences at 98445 ns and it ends
“inv_coef_val”. The output of the inverse quantizer is at 1638311 ns. Similarly, the reconstructed data
“pix_Cb_rec_out” starts at 99821 ns and ends at 1638655
fed to the dual RAM module “dram_inter” to get the
ns, while “pix_Cr_rec_out” commences at 100165 ns and
four coefficients “inv_coef0” to “inv_coef3”and the
ends at 1638999 ns. Some of these start/end time
validity of these coefficients are asserted by the signal
waveforms are not presented here since they occupy lots
“inv_coef_val”. These four coefficients are fed to the
of space.
inverse transform module and the output of this module
is the reconstructed residual values “pix_res_rec” and In summary, the reconstructed picture pixels start
the validity is asserted by the signal “pix_res_rec_val”. issuing at 98445 ns (Fig. 6) and ends at 1638999 ns
The inverse transformation is the reverse process of (Fig. 7), thus taking 1540554 ns for processing a complete
transformation just as inverse quantization is the reverse frame of a video sequence. Since each “read_clk” cycle
of quantization process. is of duration 2 ns during simulation, it takes 770277
“read_clk” cycles to reconstruct the entire data.
The architectures for intraprediction and TQIQIT for Therefore, for a picture of size 512x256 pixels such as
chrominance (Cb and Cr) are similar to that of luminance Lena used in the present simulation, it takes 5.9 clock
and hence not presented in this paper. cycles to process each pixel. Assuming an operating
frequency of 124 MHz for “read_clk”, this works out to
6. SIMULATION RESULTS AND DISCUSSIONS 6.24 milli second per frame ignoring latency, which is
The various modules described in previous sections were small. This assumption is valid since Verilog design
coded in Verilog, the top design being called “top_tqiqit”. works at 124 MHz as has been presented in the next
There are several other sub modules instantiated by this section, FPGA implementation. Extrapolating this
top design module. Also a test bench was developed so processing time for a picture of resolution 1024x768
pixels, we get the processing time of 37.4 milli second
that the design may be tested using ModelSim. A Matlab
per frame or in other words, we have achieved a frame
program was first written that accepts a standard true
rate of 25 pictures per second.
color picture in 4:4:4 “tif” format as input and converts
it to luminance (Y) and chrominance (C) components in The H.264 video encoder was first implemented in
4:2:2 “tif” format. These “tif” files are converted to “raw” Matlab in order to estimate the quality of the
format using standard software such as Irfan View. These reconstructed image and the compression that can be
files are in raster scan order and they need to be achieved. In addition, Matlab output serves as a reference
converted to Macro block/Sub block before it can be used for verifying the Verilog output. Subsequently, the core
in Modelsim for simulation. Therefore, a C++ program modules of encoder as described earlier were realized
was written to convert the raw picture into Sub blocks using Verilog for ASIC/FPGA implementation. The
as a “txt” file, which serves as the input to the Verilog resulting qualities of the reconstructed images obtained
module TQIQIT. The Verilog design “top_tqiqit”, whose with intra-prediction using Matlab and Verilog compare
architecture was presented earlier was run in the favorably as can be seen from Fig.8. It may be seen from
Modelsim to get the reconstructed picture in 4:2:0 “txt” the Fig. 8(b) and 8(c) that Verilog result is very close to
format. These reconstructed “txt” files (Y, Cb, Cr) were the Matlab result since the Verilog codes use at least
16-bits precision.
converted back to “raw” format using another C++
FPGA implementation, which processes motion pictures

of size 1024x768 pixels at 25 frames/sec. is faster by about
two times the SOC/ASIC implementation reported by
Qiang Peng et al. [6].
7. FPGA IMPLEMENTATION
The various modules described in previous sections were
coded in Verilog, simulated using ModelSim,
synthesized using Synplify Pro 8.5 and place and routed
using Xilinx Project Navigator ISE 8.2. The target device
chosen was Xilinx Vertex-II Pro XUPVP30 -7 FF896 FPGA
since the board available in our laboratory is based on
this FPGA. The core parts of the Video encoder design
Fig 6: Reconstructed Picture Waveforms of First Sub Block described in previous sections utilizes 863,469 gates and
12 numbers of block RAMs with 1666 numbers of
occupied slices. The maximum frequency of operation
is 124 MHz for “read_clk”. This works out to a frame
rate of 25 per second for a picture size of 1024x768 pixels
as explained earlier. With higher speed FPGA, the frame
rate can be increased to 30. The Verilog codes developed
for this project is fully RTL compliant and technology
independent. As a result, it can work on any FPGA or
ASIC without needing to change any code. As ASIC, it is
likely to work for higher resolutions up to 1600x1200
pixels at 30 frames/sec.
8. CONCLUSION
Fig 7: Reconstructed Picture Waveforms of Last Sub Block

An FPGA implementation of the core processors of H.264
Video Codec has been presented. It uses the horizontal
mode of intraprediction. While intraprediction was
found to give higher compression, the gains obtained
vary with the video sequence. Although the 4x4 integer
transform used is significantly simpler and faster than
the 8x8 DCT used in MPEG-2, processing speed is offset
by intraprediction and latency. A significant
improvement over MPEG-2 is the reduction of blocking
artifacts, especially for high compression, even without
Fig 8: Simulation Results of H.264 TQIQIT Processor
using de-blocking filter. The desired compression can be
(a) Original Lena Image (512x512 pixels) selected by the user in the implemented H.264 encoder
(b) Reconstructed Lena Image using Matlab with modules. The FPGA implementation of the present work
Intraprediction (PSNR: 35.5 dB) produces high quality reconstructed pictures and
compares favorably with another implementation.
(c) Reconstructed Lena Image using Verilog with
Intraprediction (PSNR: 35.2 dB) REFERENCES
In simulation results presented earlier, Lena image [1] Joint Video Team, Draft ITU-T Recommendation and
with a resolution of 512x256 pixels was used. However, Final Draft International Standard of Joint Video
for reconstruction, we use 512x512 pixels. The results Specification, ITU-T Rec H.264 and ISO/IEC 14496 AVC,
were obtained for Qstep_in = 16 (QP = 28). The March 2005.
compression achieved for 4:2:0 format was 15 for intra- [2] Advanced Video Coding for Generic Audio-visual
prediction and 11.7 without intra-prediction using Services”, ITU-T H.264.
Matlab, thus revealing a substantial improvement of [3] “Generic Coding of Moving Pictures and Associated
compression for horizontal mode of intra-prediction. Audio”, ISO/IEC JTCI CD 13818, 1994.
Verilog result for no intra-prediction, however, offered [4] Sadiqullah Khan, Gulistan Raja, “Integer Cosine
a higher PSNR value, namely, 37.3 dB. The proposed Transform and its Application in Image/Video
Compression”, ICSEA-2004 Conference Proceedings, [12] K. M. Cheung, F. Pollara, and M. Shahshahani, “Integer
Islamabad, pp. 189-193, December 2004. Cosine Transform for Image Compression,” The
[5] Liu Ling-zhi, Qiu Lin, Rong Meng-tian, Jiang Li, “A 2-D Telecommunications and Data Acquisition Progress Report 42-
Forward/inverse Integer Transform Processor of H.264 105, Vol. January-March 1991, Jet Propulsion Laboratory,
based on Highly Parallel Architecture,” Proceedings of the Pasadena, California, pp. 45-60, May 15, 1991.
4th IEEE International Workshop on Sytem-on-Chip for Real- [13] Thomas Wiegand and Gary J. Sullivan, “Overview of the
Time Applications (IWSOC’04), 2004. H.264/AVC Video Coding Standard,” IEEE Transactions
[6] Qiang Peng and Jin Jing, “H.264 System on Chip Design On Circuits and Systems For Video Technology, pp. 1-17,
and Verification”, The IEEE 2003 Workshop on Signal July 2003.
Processing Systems (SIPS’03), 2003.
[14] J. Ribas-Corbera, P. A. Chou, and S. Regunathan: “A
[7] I. E. G Richardson, “H.264 and MPEG-4 Video Generalized Hypothetical Reference Decoder for H.264/
Compression (Video Coding for Next Generation
AVC,” IEEE Transactions on Circuits and Systems for Video
Multimedia)”, John Wiley, January 2004.
Technology, pp. 18-32, July 2003.
[8] “MPEG-4 Overview”, ISO/IEC JTC 1/SC29/WG11
N4668. [15] Keshaveni. N, Ramachandran. S, K.S. Gurumurthy:
“Design and Implementation of Integer Transform and
[9] D. LeGall, “MPEG: A video Compression Standard for
Quantization Processor for H.264 Encoder on FPGA”,
Multimedia Application,” Communication, ACM, 34, pp.
46-58, Apr. 1991. International Conference on Advances in Computing, Control
and Telecommunication Technologies, December 2009.
[10] F. Pan, “Fast Intra Mode Decision Algorithm for H.264/
AVC Video Coding,” Proceedings, International Conference, [16] Keshaveni. N, Ramachandran. S, K.S. Gurumurthy:
Image Processing (ICIP), 2, pp. 781-784, Oct. 2004. “Implementation of Context Adaptive Variable Length
[11] W. K. Cham, “Development of Integer Cosine Transforms Coder for H.264 Video Encoder”, International Journal
by the Principle of Dyadic Symmetry,” IEE Proceedings, of Recent Trends in Engineering [ISSN: 1797-9617] by the
136, pt. I, No. 4, pp. 276-282, August, 1989. Academy Publishers, Finland.
View publication stats

Design and Implementation of Integer Transform and Quantization Processor For H.264 Encoder On FPGA

Uploaded by

Copyright:

Available Formats

Design and Implementation of Integer Transform and Quantization Processor For H.264 Encoder On FPGA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Design and Implementation of Integer Transform and Quantization Processor For H.264 Encoder On FPGA

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Design and Implementation of Integer Transform and Quantization Processor

Conference Paper · January 2010

N. Keshaveni Gurumurthy Kargal

power optimised techniques for VLSI Testing View project

The user has requested enhancement of the downloaded file.

Design and FPGA Implementation of Integer Transform and Quantization

1. INTRODUCTION present work. A novel parallel algorithm is presented in

mode, a prediction block is formed based on previously

5. ARCHITECTURE OF INTRA PREDICTION, INTEGER

5.2. Detailed Architecture of TQIQIT Processor

Fig 4: Architecture of Intra-prediction Module

The module “ram_predict_y” contains the predicted

FPGA implementation, which processes motion pictures

Fig 7: Reconstructed Picture Waveforms of Last Sub Block

View publication stats

You might also like