Design and Implementation of Integer Transform and Quantization Processor For H.264 Encoder On FPGA
Design and Implementation of Integer Transform and Quantization Processor For H.264 Encoder On FPGA
Design and Implementation of Integer Transform and Quantization Processor For H.264 Encoder On FPGA
net/publication/224097950
CITATIONS READS
5 610
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by N. Keshaveni on 21 July 2015.
ABSTRACT
This paper proposes a novel implementation of the core processors, intra prediction, the integer transform, quantization,
inverse quantization and inverse transformation for H.264 Video Encoder using an FPGA. It is capable of processing
video frames with the desired compression controlled by the user input. The algorithm and architecture of the core
modules of the video encoder namely, horizontal mode of intra prediction, the integer transform, quantization, inverse
quantization and inverse transformation were developed, designed and coded in Verilog. The complete H.264
Advanced Video Codec was coded in Matlab in order to verify the results of the Verilog implementation. The processor
is implemented on a Xilinx Vertex–II Pro XUPVP30 FPGA. The gate count of the implementation is approximately
870,000. It can process 1024x768 pixels moving color pictures in 4:2:0 format at 25 frames per second. The reconstructed
picture quality is better than 35 dB.
Keywords: Intraprediction, Integer transform, quantization, Inverse quantization, Inverse transform, Codec, Verilog,
FPGA.
In the present work, the core processors such as the b. Horizontal Mode Prediction block (shaded part)
intra prediction, integer transform, quantization, inverse for processing the current sub block;
quantization and inverse transform (TQIQIT) were c. TQIQIT Processing (Reconstruction of Residual
implemented. They are shown shaded in the figure. The sub block).
design of remaining modules is involved and the
The best prediction mode would be the one by which
development of the same is under progress.
the predicted block most closely matches the actual block.
In order to accomplish that, the predicted block would
3. ARCHITECTURAL DETAILS OF HORIZONTAL MODE
have to be generated using all the modes, then compared
OF INTRAPREDICTION
with the actual block to find out the best mode. This
Within a picture frame, pixels close to each other tend to would involve enormous amount of computation. Also,
have similar values. Intraprediction is done in order to until the best mode is found, the processing would be
exploit the spatial redundancy within a frame. Each pixel stalled thereby bringing down the throughput of the
is predicted based on the values of its neighboring pixels encoder. Further, no particular mode offers better
that are available. Instead of processing the pixel value, compression than the others, and which changes
only the difference between the actual value of the pixel dynamically with the picture being processed. Of these
and its predicted value known as the residual pixel is nine modes, the horizontal mode of prediction is aptly
processed. If a block or macro block is encoded in intra suited for fast implementation as an ASIC or an FPGA
Design and FPGA Implementation of Integer Transform and Quantization Processor... 45
consuming minimum hardware. For these reasons, the transform and quantization in an earlier work [15]. The
horizontal mode of intraprediction has been chosen for core integer transform is expressed as two-stage matrix
this implementation. multiplication as shown in Eq. 1. The values X00 to X33 are
the residual pixel inputs from the intra-prediction stage
The basic principle involved in the horizontal mode
contained in matrix X as described in the previous
of intra prediction implemented in this work is shown
in Fig. 2. A picture is processed macro block by macro section. C and C’ (the transpose of C) are constant
block in the order from top to bottom and from left to matrices. W, containing elements W00 to W33 is a matrix
right. A macro block consists of 16x16 pixels. These are of coefficients after transforming the matrix X.
further divided into 4x4 pixels sub blocks: B0 to B15 as W00 W01 W02 W03 1 1 1 1
shown in Fig. 2(a). These sub blocks are processed in the W W11 W12 W13 2 1 −1 −2
order: B0, B1, .., B3; B4, B5, ….., B15. As an example, B6 10
is shown as the current sub block that is required to be W20 W21 W22 W23 = 1 −1 −1 1
processed. The pixel values of this sub block are ‘p1’, W W31 W32 W33 1 −2 2 −1
30
‘p2’, …, ‘p16’. It may be noted that just before this sub
block, B5 was already processed. X00 X01 X02 X03 1 2 1 1
X X11 X12 X13 1 1 −1 −2
The shaded part in Fig. 2(b) shows the horizontal 10
mode prediction block for processing the current sub X 20 X 21 X 22 X 23 1 −1 −1 2
block. As shown therein, ‘d’, ‘c’, ‘b’ and ‘a’ pixels serve X X 31 X 32 X 33 1 −2 1 −1
30
as the prediction for the current sub block B6. These
prediction pixels belong to the last (4th) column of the Or in short, W = C * X * C’ (1)
previously reconstructed sub block B5. The pixels ‘e’ to
Each of the transformed coefficients Wij is quantized
‘m’ is the already reconstructed last row pixels of the
by a scalar quantizer as specified in Ref. [1]. A total of 52
upper sub blocks. However, these pixels are not used in
the horizontal prediction. It may be noted that the values of quantization step size (Qstep) are supported
prediction for the current sub block being processed is by the standard and these are indexed by a Quantization
always the last column of the recently reconstructed sub Parameter, QP. Qstep doubles in size for every increment
block. In the above example, B5 offers prediction for B6. of 6 in QP. The wide range of quantizer step sizes makes
As another example, the reconstructed last column of it possible for an encoder to accurately and flexibly
B0 forms the prediction for the current sub block B1. As control the trade-off between bit rate and quality. The
yet another example, the reconstructed last column of quantized coefficients are computed as:
B11 forms the prediction for the current sub block B12. Zij = floor (Wij * MF / 2qbits) (2)
Fig. 2(c) shows the reconstruction of the current sub where qbits = 15 + floor (QP/6) and MF is a multiplication
block (B6 in this example) by processing TQIQIT. For factor specified in the H.264 reference model software
processing the integer transform, the residual values of
of the standard. The algorithm for the integer transform
the sub block (and not the actual pixel values) are taken
and quantization is as follows:
as the inputs. The residual values are got by taking the
pixel-wise difference between the current sub block (B6) 1. Multiply/add the first row of C with each
and the prediction sub block (B5). The processing of column of X one after another to generate the
TQIQIT is explained in detail in subsequent sections. The first row of partial products, P 00 – P 03 .
reconstructed residual pixels of the sub block are Multiplications involved are trivial since 1, -1,
obtained after processing of TQIQIT. Subsequently these 2, -2 are the multiplying constants.
reconstructed residual pixels are added with the
corresponding prediction sub block pixels to get the 2. Multiply/add the second row of C with each
reconstructed sub block (B6 being an example). For the column of X one after another to generate the
first sub block B0 of a macro block no pixels are available second row of partial products, P 10 – P 13 .
to generate the predicted blocks. Therefore, the predicted Concurrently multiply the first row of partial
block of such a block has its entire pixel values as “0”, products P00 – P03 (generated in the previous step)
i.e., the block is processed without prediction. with each of the columns of C’ one after another
to generate the first row of integer transformed
4. ALGORITHM FOR PARALLEL MATRIX MULTIPLICATION coefficients. Pipeline the quantization
OF INTEGER TRANSFORM, QUANTIZATION AND (multiplication with MF) as per Eq. 2
THEIR INVERSES immediately after each integer coefficient Wij is
generated. It may be noted that the computation
4.1. Integer Transform and Quantization 2qbits is just right shift operation dispensing with
A novel parallel algorithm that is capable of being highly division. In this step, the quantized coefficients
pipelined has been developed for computing the integer Zij are generated.
46 International Journal of Computer Science & Communication (IJCSC)
3. Repeat the step 2 for the third and fourth rows components are valid when the “datain_valid” signal is
of C to generate the rest of the sixteen quantized asserted. The luminance and chrominance components
coefficients. are written into a “dual RAM” at the rising edge of
“write_clk” signal. Thus, the RAMs store two blocks of
4.2. Inverse Quantization and Inverse Integer 16 lines, i.e., two macro block rows. A macro block
Transform consists of 16x16 pixels. As one RAM buffer gets filled,
the intraprediction is processed concurrently from the
The inverse quantization is expressed as
other buffer previously filled. The stored data is read
Wiji = Zij * Vij * 2floor(QP/6) (3) from the “dual RAM” for further processing at the rising
edge of “read_clk” signal. The system is reset at the time
where Zij are the quantized coefficients and Vij are the
of powering on using an asynchronous active low signal
rescaling factors dependent upon the coefficient position
“reset_n”. Just as a microprocessor may be halted at any
as specified in the H.264 standard. The inverse integer
point of time, the TQIQIT processing may also be
transform that follows the inverse quantization stage is
temporarily suspended using the “halt” signal in order
as follows:
to allow the processor CAVLC to catch up with the
X = Ci' * Wi * Ci (4) TQIQIT processor. The desired compression may be set
by the quantization parameter “Qstep_in [1:0]”, which
where
is user configurable. After processing the desired data
1 1 1 1 using chrominance components “pix_cb_rec_out” and
1 1 2 − 1 2 −1 “pix_cr_rec_out” are output along with their
Ci = 1 −1 −1 1 corresponding valid signals. “q_coef” is the output after
1 −1 1 − 1 quantization and it is fed to CAVLC Processor for
2 2
effecting compression.
and Ci' is its transpose.
The algorithm for the inverse quantization and the
inverse integer transform is similar to that of forward
transform and quantization and, therefore, not presented
here.
The “four_pix_out_y” module outputs current pixel In the next module called “intrapred_mem_y”, the
values, one column pixel values at a time at the pins difference between the current sub block pixel values
marked “pix_fpo0” to “pix_fpo3”, with the signal “pix_fpo0” to “pix_fpo3” and the predicted values gives
“pix_fpo_valid” serving as their valid signal. The entire the residual values “pix0_y_res” to “pix3_y_res” with
sub block is output in 4 clock cycles. The pixels output “pix_y_res_valid” signal asserted. These (four) values
in these clock cycles are p1, p5, p9, p13 in the first cycle, are fed to the TQIQIT module to get back the
p2, p6, p10, p14 in the second cycle, p3, p7, p11, p15 in reconstructed residual values (pix_y_res_rec). The
the third cycle and p4, p8, p12, p16 in the last cycle. These reconstructed residual values in turn are added to the
are the current sub block pixel values presented in Fig. 2
predicted values (d, c, b, a) to get the reconstructed sub
earlier.
block as described in the “ram_predict_y” module. The
signal “pix_y_req_dram_out” is pixel request to dual
ram. When this signal is high, the dual ram outputs pixels
to the intra-predict module.
The transformed coefficients are fed to the quantizer. program. This is still in Macro block/Sub block order.
The quantization is performed according to the Eq. 2. This is finally converted to a “tif” format using another
The signal “q_rsh” is the external input to the quantizer Matlab program. This program automatically displays
module used to decide the desired compression with the both the original picture as well as the reconstructed
help of Q_step. The quantization is performed by right picture. The Matlab program also computes the quality
shift operation. The output of the quantizer is the of the reconstructed picture referred to as PSNR
quantized coefficient “q_coef” and the validity of the data expressed in dB.
is asserted by the signal “q_coef_val”. These outputs are
The simulated waveforms are shown in Fig. 6 and 7.
fed as inputs to the next Processing module CAVLC,
The reconstructed picture is generated at every rising
which is not part of this work. edge of “read_clk” with latency coming into play for
After quantization, the coefficients are fed to the every sub block processed. The first Sub block
inverse quantization module. The signal “q_lsh” is used reconstructed pixel values “pix_Y_rec_out”,
as the inverse quantization parameter. The inverse “pix_Cb_rec_out” and “pix_Cr_rec_out” and their
quantization is performed by left shift operation. The corresponding valid signals are shown in Fig.6 and 7.
output of the inverse quantizer is the signal “inv_coef” From these waveforms, we observe that the reconstructed
and the validity of the data is asserted by the signal data “pix_Y_rec_out” commences at 98445 ns and it ends
“inv_coef_val”. The output of the inverse quantizer is at 1638311 ns. Similarly, the reconstructed data
“pix_Cb_rec_out” starts at 99821 ns and ends at 1638655
fed to the dual RAM module “dram_inter” to get the
ns, while “pix_Cr_rec_out” commences at 100165 ns and
four coefficients “inv_coef0” to “inv_coef3”and the
ends at 1638999 ns. Some of these start/end time
validity of these coefficients are asserted by the signal
waveforms are not presented here since they occupy lots
“inv_coef_val”. These four coefficients are fed to the
of space.
inverse transform module and the output of this module
is the reconstructed residual values “pix_res_rec” and In summary, the reconstructed picture pixels start
the validity is asserted by the signal “pix_res_rec_val”. issuing at 98445 ns (Fig. 6) and ends at 1638999 ns
The inverse transformation is the reverse process of (Fig. 7), thus taking 1540554 ns for processing a complete
transformation just as inverse quantization is the reverse frame of a video sequence. Since each “read_clk” cycle
of quantization process. is of duration 2 ns during simulation, it takes 770277
“read_clk” cycles to reconstruct the entire data.
The architectures for intraprediction and TQIQIT for Therefore, for a picture of size 512x256 pixels such as
chrominance (Cb and Cr) are similar to that of luminance Lena used in the present simulation, it takes 5.9 clock
and hence not presented in this paper. cycles to process each pixel. Assuming an operating
frequency of 124 MHz for “read_clk”, this works out to
6. SIMULATION RESULTS AND DISCUSSIONS 6.24 milli second per frame ignoring latency, which is
The various modules described in previous sections were small. This assumption is valid since Verilog design
coded in Verilog, the top design being called “top_tqiqit”. works at 124 MHz as has been presented in the next
There are several other sub modules instantiated by this section, FPGA implementation. Extrapolating this
top design module. Also a test bench was developed so processing time for a picture of resolution 1024x768
pixels, we get the processing time of 37.4 milli second
that the design may be tested using ModelSim. A Matlab
per frame or in other words, we have achieved a frame
program was first written that accepts a standard true
rate of 25 pictures per second.
color picture in 4:4:4 “tif” format as input and converts
it to luminance (Y) and chrominance (C) components in The H.264 video encoder was first implemented in
4:2:2 “tif” format. These “tif” files are converted to “raw” Matlab in order to estimate the quality of the
format using standard software such as Irfan View. These reconstructed image and the compression that can be
files are in raster scan order and they need to be achieved. In addition, Matlab output serves as a reference
converted to Macro block/Sub block before it can be used for verifying the Verilog output. Subsequently, the core
in Modelsim for simulation. Therefore, a C++ program modules of encoder as described earlier were realized
was written to convert the raw picture into Sub blocks using Verilog for ASIC/FPGA implementation. The
as a “txt” file, which serves as the input to the Verilog resulting qualities of the reconstructed images obtained
module TQIQIT. The Verilog design “top_tqiqit”, whose with intra-prediction using Matlab and Verilog compare
architecture was presented earlier was run in the favorably as can be seen from Fig.8. It may be seen from
Modelsim to get the reconstructed picture in 4:2:0 “txt” the Fig. 8(b) and 8(c) that Verilog result is very close to
format. These reconstructed “txt” files (Y, Cb, Cr) were the Matlab result since the Verilog codes use at least
16-bits precision.
converted back to “raw” format using another C++
Design and FPGA Implementation of Integer Transform and Quantization Processor... 49
7. FPGA IMPLEMENTATION
The various modules described in previous sections were
coded in Verilog, simulated using ModelSim,
synthesized using Synplify Pro 8.5 and place and routed
using Xilinx Project Navigator ISE 8.2. The target device
chosen was Xilinx Vertex-II Pro XUPVP30 -7 FF896 FPGA
since the board available in our laboratory is based on
this FPGA. The core parts of the Video encoder design
Fig 6: Reconstructed Picture Waveforms of First Sub Block described in previous sections utilizes 863,469 gates and
12 numbers of block RAMs with 1666 numbers of
occupied slices. The maximum frequency of operation
is 124 MHz for “read_clk”. This works out to a frame
rate of 25 per second for a picture size of 1024x768 pixels
as explained earlier. With higher speed FPGA, the frame
rate can be increased to 30. The Verilog codes developed
for this project is fully RTL compliant and technology
independent. As a result, it can work on any FPGA or
ASIC without needing to change any code. As ASIC, it is
likely to work for higher resolutions up to 1600x1200
pixels at 30 frames/sec.
8. CONCLUSION
Compression”, ICSEA-2004 Conference Proceedings, [12] K. M. Cheung, F. Pollara, and M. Shahshahani, “Integer
Islamabad, pp. 189-193, December 2004. Cosine Transform for Image Compression,” The
[5] Liu Ling-zhi, Qiu Lin, Rong Meng-tian, Jiang Li, “A 2-D Telecommunications and Data Acquisition Progress Report 42-
Forward/inverse Integer Transform Processor of H.264 105, Vol. January-March 1991, Jet Propulsion Laboratory,
based on Highly Parallel Architecture,” Proceedings of the Pasadena, California, pp. 45-60, May 15, 1991.
4th IEEE International Workshop on Sytem-on-Chip for Real- [13] Thomas Wiegand and Gary J. Sullivan, “Overview of the
Time Applications (IWSOC’04), 2004. H.264/AVC Video Coding Standard,” IEEE Transactions
[6] Qiang Peng and Jin Jing, “H.264 System on Chip Design On Circuits and Systems For Video Technology, pp. 1-17,
and Verification”, The IEEE 2003 Workshop on Signal July 2003.
Processing Systems (SIPS’03), 2003.
[14] J. Ribas-Corbera, P. A. Chou, and S. Regunathan: “A
[7] I. E. G Richardson, “H.264 and MPEG-4 Video Generalized Hypothetical Reference Decoder for H.264/
Compression (Video Coding for Next Generation
AVC,” IEEE Transactions on Circuits and Systems for Video
Multimedia)”, John Wiley, January 2004.
Technology, pp. 18-32, July 2003.
[8] “MPEG-4 Overview”, ISO/IEC JTC 1/SC29/WG11
N4668. [15] Keshaveni. N, Ramachandran. S, K.S. Gurumurthy:
“Design and Implementation of Integer Transform and
[9] D. LeGall, “MPEG: A video Compression Standard for
Quantization Processor for H.264 Encoder on FPGA”,
Multimedia Application,” Communication, ACM, 34, pp.
46-58, Apr. 1991. International Conference on Advances in Computing, Control
and Telecommunication Technologies, December 2009.
[10] F. Pan, “Fast Intra Mode Decision Algorithm for H.264/
AVC Video Coding,” Proceedings, International Conference, [16] Keshaveni. N, Ramachandran. S, K.S. Gurumurthy:
Image Processing (ICIP), 2, pp. 781-784, Oct. 2004. “Implementation of Context Adaptive Variable Length
[11] W. K. Cham, “Development of Integer Cosine Transforms Coder for H.264 Video Encoder”, International Journal
by the Principle of Dyadic Symmetry,” IEE Proceedings, of Recent Trends in Engineering [ISSN: 1797-9617] by the
136, pt. I, No. 4, pp. 276-282, August, 1989. Academy Publishers, Finland.