Video Codec Design
Video Codec Design
Video Codec Design
Iain E. G. Richardson
Copyright q 2002 John Wiley & Sons, Ltd
ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic)
To
Freya and Hugh
lain E. G. Richardson
The Robert Gordon University, Aberdeen, UK
Copyright
National
International
01243
719111
( +44) 1243 119111
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Image andVideo Compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2VideoCODECDesign
.....................................
1.3 Structure of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Digitalvideo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Concepts. Capture andDisplay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 The Video Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Digital Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3
Video
Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.5 Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Colour Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1
RGB
............................................
2.3.2 YCrCb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 The HumanVisual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5Video
Quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Subjective Quality Measurement . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Objective Quality Measurement . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Standards for Representing Digital Video . . . . . . . . . . . . . . . . . . . . . . .
2.7 Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.1 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
2.2
3 ImageandVideoCompressionFundamentals
3.1
3.2
3.3
1
1
2
2
5
5
5
5
5
7
7
9
10
11
12
16
16
17
19
23
24
25
25
26
. . . . . . . . . . . . . . . 27
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1Do We Need Compression? . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Image andVideo Compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 DPCM (Differential Pulse Code Modulation). . . . . . . . . . . . . . . .
3.2.2 Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Motion-compensated Prediction . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.4 Model-based Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ImageCODEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Quantisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
27
28
30
31
31
32
33
33
35
vi
CONTENTS
4 VideoCodingStandards:JPEGandMPEG
.................
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The International Standards Bodies . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 The Expert Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 The Standardisation Process . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Understanding andUsing the Standards . . . . . . . . . . . . . . . . . . .
4.3 JPEG (Joint Photographic Experts Group) . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 JPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Motion JPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 JPEG-2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 MPEG (Moving Picture Experts Group) . . . . . . . . . . . . . . . . . . . . . . . .
4.4.l MPEG- 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.2 MPEG-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 MPEG-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
4.2
VideoCodingStandards:H.261,H.263andH.26L
...........
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
H.261 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
H.263 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 The H.263 Optional Modes/H.263+ . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 H.263 Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 H.26L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6 Performance of the Video Coding Standards . . . . . . . . . . . . . . . . . . . . .
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
5.2
5.3
37
40
41
42
43
45
45
45
47
47
47
48
50
50
51
51
56
56
58
58
64
67
76
76
79
79
80
80
81
81
86
87
90
91
92
93
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
Motion Estimation and Compensation . . . . . . . . . . . . . . . . . . . . . . . . . .
94
6.2.1 Requirements for Motion Estimation and Compensation . . . . . . . . 94
6.2.2 Block Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
6.2.3 Minimising Difference Energy . . . . . . . . . . . . . . . . . . . . . . . . . .
97
6.3 Full Search Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
6.4
Fast
Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
102
6.4.1 Three-Step Search (TSS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
102
6.1
6.2
CONTENTS
vii
103
104
105
105
107
109
111
113
113
113
113
114
115
115
115
116
116
117
117
122
125
125
TransformCoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
127
127
127
133
138
138
140
144
145
146
146
148
150
152
153
156
157
160
161
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Discrete Cosine Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4 Fast Algorithms for the DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.1 Separable Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.2 Flowgraph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.3 Distributed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.4 Other DCT Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5 Implementing the DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.1 Software DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.2 Hardware DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6 Quantisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6.1 Types of Quantiser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6.2 Quantiser Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6.3 Quantiser Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6.4 Vector Quantisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
S EntropyCoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2Data
Symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Run-Level Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
163
163
164
164
viii
CONTENTS
10 Rate.DistortionandComplexity
..........................
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2Bit Rate and Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.1 The Importance of Rate Control . . . . . . . . . . . . . . . . . . . . . .
10.2.2 Rate-Distortion Performance . . . . . . . . . . . . . . . . . . . . . . . .
10.2.3 The Rate-Distortion Problem . . . . . . . . . . . . . . . . . . . . . . . .
10.2.4 Practical Rate Control Methods . . . . . . . . . . . . . . . . . . . . . .
10.3 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3.1 Computational Complexity and Video Quality . . . . . . . . . . . .
10.3.2 Variable Complexity Algorithms. . . . . . . . . . . . . . . . . . . . . .
10.3.3 Complexity-Rate Control . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11 Transmission of CodedVideo . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1.2 Quality
1l .2. l
l 1.2.2
11.2.3
167
169
i69
174
174
177
180
184
186
188
191
192
193
195
195
195
196
198
199
199
206
207
208
208
209
211
211
212
212
215
217
220
226
226
228
231
232
232
235
235
of Service Requirements and Constraints . . . . . . . . . . . . . . . . 235
QoS Requirements for Coded Video . . . . . . . . . . . . . . . . . . . 235
Practical QoS Performance . . . . . . . . . . . . . . . . . . . . . . . . .
239
Effect of QoS Constraints on Coded Video . . . . . . . . . . . . . . 241
CONTENTS
12 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2 General-purpose Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2.1 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2.2 Multimedia Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.3 Digital Signal Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.4 Embedded Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.5 Media Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.6Video Signal Processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.7 Custom Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.8 CO-processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 VideoCODECDesign
...................................
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2 VideoCODEC Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2.1 Video IdOut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2.2 Coded DataIn/Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2.3 Control Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2.4 Status Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3 Design of a Software CODEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3.2 Specification and Partitioning . . . . . . . . . . . . . . . . . . . . . . . .
13.3.3 Designing the Functional Blocks . . . . . . . . . . . . . . . . . . . . .
13.3.4 Improving Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.4 Design of a Hardware CODEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.4.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.4.2 Specification and Partitioning . . . . . . . . . . . . . . . . . . . . . . . .
13.4.3 Designing the Functional Blocks . . . . . . . . . . . . . . . . . . . . .
13.4.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14 FutureDevelopments . . . . . . .
14.1 Introduction . . . . . . . . . . . . .
ix
244
244
244
247
249
249
252
254
255
257
257
257
258
258
260
262
263
264
266
267
269
270
271
271
271
271
274
276
277
278
278
279
282
283
284
284
284
285
286
286
287
287
289
289
CONTENTS
14.2 Standards
Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.3Video
Coding Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.4 Platform
Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.5 Application
Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.6Video
CODEC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
289
290
290
291
292
293
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
295
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
297
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
301
Index
Application Programming Interface, 276
artefacts
blocking, 199
ringing, 200
block matching, 43, 95
blockiness. See artefacts:blocking
B-picture, 59
chrominance, 12
CODEC
entropy, 37, 45, 163
image, 33
video, 41
coded block pattern, 70, 167
coding
arithmetic, 188
channel, 29
content-based, 70
entropy, 163
field, 64
Huffman, 169
lossless, 28
lossy, 28
mesh, 74
model based, 32
object-based, 70
run-level, 37, 164
scalable, 65, 73
shape, 70
source, 29
sprite, 73
transform, 3 1, 127
colour space, 10
RGB, 11
YCrCb, 12
complexity
complexity-rate control, 23 1
computational, 226
variable complexity algorithms, 228
compression
image and video, 28
DCT, 31, 127
302
H.263 (Continued)
PB-frames, 82
profiles, 86
H.26L, 87, 289
H.323, 252
high definition television, 67
Human Visual System, 16
HVS. See Human Visual System
interface
coded data, 274
control and status, 276, 277
to video CODEC, 271
video, 271
interframe, 41
International Standards Organisation, 47
International Telecommunications Union, 47
intraframe, 41
I-picture, 59
ISO. See International Standards Organisation
ITU. See International Telecommunications
Union
JPEG, 51
baseline CODEC, 51
hierarchical, 55
lossless, 54
Motion, 56
JPEG2000, 56
KLT, 31
latency, 240, 237, 243
luminance, 12
memory bandwidth. 274
MJPEG. See JPEG:motion
motion
compensation, 43, 94
estimation, 43, 94, 109
Cross Search, 104
full search, 99
hardware, 122
hierarchical, 107
Logarithmic Search, 103
nearest neighbours search, 105
OTA, 105
performance, 109
software, 117
sub-pixel, 11 1
Three Step Search, 102
vectors, 43, 94, 167
MPEG, 47
MPEG- 1, 58
syntax, 61
INDEX
MPEG-2, 64
Program Stream, 250
systems, 249
Transport Stream, 250
video, 64
MPEG-21,49, 289
MPEG-4, 67
Binary Alpha Blocks, 71
profiles and levels, 74
Short Header, 68
Very Low Bitrate Video core, 68
Video Object, 68
Video Object Plane, 68
video packet, 73
MPEG-7,49, 289
Multipoint Control Unit, 253
OBMC. See motion compensation
prediction
backwards, 1 13
bidirectional, 113
forward, 1 13
processors
co-processor, 267
DSP, 260
embedded, 262
general purpose, 257
media, 263
PC, 257
video signal, 264
profiles and levels, 66, 74
quality, 16
DSCQS, 17
ITU-R 500-10, 17
objective, 19
PSNR, 19
recency, 18
subjective, 17
Quality of Service, 235
quantisation, 35, 150
scale factor, 35
vector, 157
rate, 212
control, 212, 220
Lagrangian optimization, 2 18
rate-distortion, 217
Real Time Protocol, 252, 254
redundancy
statistical, 29
subjective, 30
reference picture selection, 84, 247
INDEX
re-ordering
pictures, 60
modified scan, 166
zigzag, 166, 37
RGB. See colour space
ringing. See artefacts: ringing
RVLC. See variable length codes
sampling
4-2-0, 272, 12
4-2-2, 272, 12
4-4-4, 12
spatial, 7
temporal, 7
scalability. See coding: scalable
Single Instruction Multiple Data, 258
slice, 63, 83
source model, 28
still image, 5
sub pixel motion estimation. See motion:
estimation
test model, 50
transform
DCT. See DCT
303
fractal, 35
integer, 145
wavelet, 35, 57, 133
Introduction
1.1 IMAGE AND VIDEOCOMPRESSION
The subject of this book is the compression (coding) of digital images and video. Within
the last 5-10 years, image and video codinghavegonefrom
being relatively esoteric
research subjects with few real applications to become key technologies fora wide range of
mass-market applications, from personal computers to television.
Like many other recent technological developments, the emergence of video and image
coding in the mass market is due to convergence of a number of areas. Cheap and powerful
processors, fast network access,the ubiquitous Internet and a large-scale research and
standardisation effort have all contributed to the development of image and video coding
technologies. Coding has enabled a host of new multimedia applications including digital
television, digital versatile disk (DVD) movies,streamingInternetvideo,homedigital
photography and video conferencing.
Compression coding bridges a crucial gap in each of these applications: the gap between
the users demands (high-quality still and moving images, delivered quickly at a reasonable
cost) andthe limited capabilities of transmission networks and storage devices. For example,
a television-quality digital video signal requires216Mbits of storage or transmission
capacity for one second of video. Transmission of this type of signal in real time is beyond
the capabilities of most present-day communications networks. A 2-hour movie (uncompressed) requires over 194 Gbytes of storage, equivalent to 42 DVDs or 304 CD-ROMs. In
orderfordigital
video tobecome a plausible alternative toitsanalogue
predecessors
(analogue television or VHS videotape), it has been necessary todevelop methods of
reducing or compressing this prohibitively high bit-rate signal.
The drive to solve this problem has taken several decades and massive efforts in research,
developmentand standardisation (and work continuestoimprove existing methodsand
develop new coding paradigms). However, efficient compression methods are now a firmly
established component of the new digital media technologies such as digital television and
DVD-video. A welcome side effect of these developmentsis that videoandimage
compression has enabledmany novelvisual communication applications that would not
have previously been possible. Some areas have taken off more quickly than others (for
example, the long-predicted boom in video conferencing has yet to appear), but there is no
doubt that visual compression is here to stay. Every new PC has a number of designed-in
features specifically to support and accelerate video compression algorithms. Most developed nations have a timetable for stopping the transmission of analogue television, after
which all television receivers will need compression technology to decode and display TV
images. VHS videotapes are finally being replaced by DVDs which can be played back on
INTRODUCTION
DVD players or on PCs. The heart of all of these applications is the video compressor and
decompressor; or enCOder/DECoder; or video CODEC.
Lr-7
2. Digital Video
3. Image and
Video
6. Motion
Estimation/
Compensation
7. Transform
Coding
8. Entropy
Coding
t-
10. Rate,
Distortion,
Complexity
-11. Transmission
Section 3: System Design -
- 12. Platforms
13. Design
14. Future
Trends
INTRODUCTION
Chapter 5 , H.261, H.263 and H.26L, explains the concepts of the ITU-T video coding
standards H.261 and H.263 and the emerging H.26L. The chapter ends with a comparison
of
the performance of the main image and video coding standards.
Chapter 6, Motion Estimation and Compensation, deals with the front end of a video
CODEC. The requirements and goals of motion-compensated prediction are explained and
the chapter discusses a number of practical approaches to motion estimation in software or
hardware designs.
Chapter 7, TransformCoding,concentratesmainlyonthepopular
discrete cosine
transform. The theory behind the DCTis introduced and practical algorithms for calculating
the forward and inverse DCTare described. The discrete wavelet transform (an increasingly
popular alternative to the DCT) andthe process of quantisation (closely linked to transform
coding) are discussed.
Chapter 8, Entropy Coding, explains the statistical compression process that forms the
final step in avideoencoder;shows
how Huffmancode tables are designedandused;
introduces arithmetic coding; and describes practical entropy encoder and decoder designs.
Chapter 9, Pre- and Post-processing, addresses the important issue of input and output
processing; shows how pre-filtering can improve compression performance; and examines a
numberof post-filtering techniques,fromsimplede-blocking
filters to computationally
complex, high-performance algorithms.
Chapter 10, Rate, Distortion and Complexity, discusses the relationships between compressed bit rate, visual distortion and computational complexity in a lossy video CODEC;
describes rate control algorithms for different transmission environments; and introducesthe
emergingtechniques
of variable-complexitycoding
that allow the designer to trade
computational complexity against visual quality.
Chapter 11, Transmission of Coded Video, addresses the influence of the transmission
environment on video CODEC design; discusses the quality of service required by a video
CODEC and providedby typical transport scenarios; and examinesways in which quality of
service can be matched between the CODEC and the network to maximise visual quality.
Chapter 12, Platforms, describes a number of alternative platforms for implementing
practical video CODECs, ranging from general-purpose PC processors to custom-designed
hardware platforms.
Chapter 13, Video CODEC Design, brings together a numberof the themes discussedin
previous chapters and discusses how they influence the design of video CODECs; examines
the interfaces between a video CODEC and other system components; and presents two
design studies, a software CODEC and a hardware CODEC.
Chapter 14, Future Developments, summarises someof the recent work in research and
development that will influence the next generation of video CODECs.
Each chapter includes referencesto papers and websites that are relevant to the topic. The
bibliography lists a number of books that may be useful for further reading and a companion
web site to the book may be found at:
http://www.vcodex.conl/videocodecdesign/
Digital Video
2.1 INTRODUCTION
Digital video is now an integral part of many aspects of business, education and entertainment, from digital TVto web-based video news. Before examining methods for compressing
and transporting digital video, it is
necessary toestablishtheconceptsandterminology
relating to video in the digital domain. Digital video is visual information represented in
a discrete form, suitable for digital electronic storage
and/or transmission. In this chapter
we describe and define the conceptof digital video: essentially a sampled two-dimensional
(2-D) version of a continuous three-dimensional (3-D) scene. Dealing
with colour video
requires us to choose a colour space (a system for representing colour) andwe discuss two
widely used colour spaces, RGB and YCrCb. The goalof a video coding system is to support
video communications with an acceptablevisualquality:thisdepends
on the viewers
perception of visual information, which in turn is governed by the behaviour of the human
visualsystem.Measuringandquantifyingvisualqualityisa
difficult problemand we
describe some alternative approaches, from time-consuming subjective tests to automatic
objective tests (with varying degrees of accuracy).
2.2.2 DigitalVideo
A real visual scene is continuous both spatially and temporally. In order to represent and
process avisual scene digitally it isnecessary to sample thereal scene spatially (typicallyon
a rectangular grid in the video image plane) and temporally (typically as a series of still
DIGITAL VIDEO
l---
Moving scene
Spatial
sampling
points
Temporal
sampling
images orframes sampled at regular intervals in time) as shown in Figure 2.2. Digital video
is the representationof a spatio-temporally sampled video scenein digital form. Each spatiotemporal sample (described as a picture element or pixel) is represented digitally as one or
more numbers that describe the brightness (luminance) and colour of the sample.
A digital video system is shown in Figure 2.3. At the input to the system, a real visual
scene is captured, typically with a camera and converted to a sampled digital representation.
Digital domain
/-
Q>Q=
Camera
Transrnlssion
Scene
CONCEPTS, CAPTURE
DISPLAYAND
This digital video signal may then be handled in the digital domain in a number of ways,
including processing, storageand transmission. At the output of the system, the digital video
signal is displayedto a viewer by reproducing the 2-D video image(or video sequence) on a
2-D display.
2.2.3 VideoCapture
Video is captured using a camera or a system
of cameras. Most current digital video systems
use 2-D video, captured with a single camera. The camera focuses a 2-D projection of the
video scene onto a sensor, such as an array of charge coupled devices (CCD array). In the
case of colourimagecapture,eachcolourcomponent(seeSection
2.3) is filtered and
projected onto a separate CCD array.
Figure 2.4 shows a two-camera system that captures two 2-D projections
of the scene,
takenfromdifferent viewing angles.Thisprovidesastereoscopicrepresentation
of the
scene:thetwoimages,
whenviewed in the left and right eye of the viewer, give an
appearance of depth to the scene. There is an increasing interest in the use of 3-D digital
video, where the video signal is representedand processed in three dimensions. This requires
the capture system to provide depth information as well as brightness and colour, and this
may be obtained in a number
ofways. Stereoscopic images can be processed to extract
approximate depth information and form a 3-D representation
of the scene: other methodsof
obtaining depth information include processing of multiple images from a single camera
(whereeitherthecameraortheobjects
in thescenearemoving)andthe
useof laser
striping to obtain depth maps. In this book we will concentrate on 2-D video systems.
Generating a digital representation of a video scene can be considered
in two stages:
acquisition (converting a projection of the scene into an electrical signal, for example via a
CCD array) and digitisation (sampling the projection spatially and temporally and converting each sample to a number or set of numbers). Digitisation may be carried out using a
separate device or board (e.g. a video capture card in a PC): increasingly, the digitisation
process is becoming integrated with cameras so that the output of a camera is a signal in
sampled digital form.
2.2.4 Sampling
A digital image maybe generated by sampling an analogue video signal (i.e. a varying
electrical signal that represents a video image) at regular intervals. The result is a sampled
DIGITAL VIDEO
Figure 2.5
101 376
VHS video
Broadcast television
High-definition television
352 x 288
704 x 576
1440 x 1152
405 504
1313 280
Video frame
Below 10 frames per second
10-20 frames per second
20-30 frames per second
SO-60 frames per second
Appearance
Jerky, unnatural appearance to movement
Slow movements appear OK; rapid movement is clearly jerky
Movement is reasonably smooth
Movement is very smooth
@
r
4,
U
Complete
Upper field
frame
Lower field
used for very low bit-rate video communications (because the amount of data is relatively
small): however, motion is clearly jerky and unnatural at this rate. Between 10 and 20 frames
per second is more typical for low bit-rate videocommunications; 25 or 30 frames per
second is standard for television pictures (together with the use of interlacing, see below); 50
or 60 frames per second is appropriate for high-quality video (at the expenseof a very high
data rate).
The visual appearance of a temporally sampled video sequence canbe improved by using
interlaced video, commonly used for broadcast-quality television signals. For example, the
European PAL video standard operates at a temporal frame rate of 25 Hz (i.e. 25 complete
frames of video per second). However, in order to improve the visual appearance without
increasing the data rate, the video sequence is composed offields at a rate of 50 Hz (50 fields
per second). Each field contains half of the lines that make up a complete frame (Figure2.6):
the odd- and even-numbered lines from the frame on the left are placed in two separate
fields, each containing half the information of a complete frame. These fields are captured
and displayed at M o t h of a second intervals and the result is an update rate of 50 Hz, with
the data rate of a signal at 25 Hz. Video that is captured and displayed in this way is known
as interlaced video and generally has a more pleasing visual appearance than video
transmitted as completeframes (non-interlaced or progressive video). Interlaced video
can, however, produce unpleasant visual artefacts when displaying certain textures or types
of motion.
2.2.5
Display
Displaying a 2-D video signal involves recreating each frame of video on a 2-D display
device. The most common type of display is the cathode ray tube (CRT) in which the image
10
DIGITAL VIDEO
Phosphor coating
Figure 2.7CRT
display
11
COLOUR SPACES
2.3.1
RGB
t
J
I
(b)
Figure 2.8 (a) Image, (b) R, ( c ) G, (d) B components
12
DIGITAL VIDEO
~~
I
Figure 2.8 (Continued)
2.3.2 Y CrCb
RGB is not necessarily the most efficient representation
of colour. The human visual system
(HVS, see Section 2.4) is less sensitive to colour than to luminance (brightness): however,
the RGB colour space does not provide
an easy way to take advantageof this since the three
colours are equally important and the luminance is present in all three colour components. It
is possible to representa colour image more efficiently
by separating the luminance from the
colour information.
A popular colour space of this type is Y: Cr :Cb. Y is the luminance component, i.e. a
monochrome version of the colour image. Y is a weighted average of R, G and B:
COLOUR SPACES
13
where k areweightingfactors.Thecolourinformationcan
be representedas colour
difference or chrominance components, where each chrominance component is the difference between R, G or B and the luminance Y:
Cr=R-Y
Cb=B-Y
Cg=G-Y
Thecompletedescriptionisgiven
by Y (theluminancecomponent)andthreecolour
differences Cr, Cb and Cg that represent the variation between
the colour intensity and the
background luminance of the image.
So far, this representation has little obvious merit: we now have four components rather
than three. However, it turns out
that the value of Cr Cb Cg isa constant. This meansthat
only two of the three chrominance componentsneed to be transmitted: the third component
can alwaysbe found from the othertwo. In the Y : Cr : Cb space, only the luminance (Y) and
red andbluechrominance (Cr, Cb)aretransmitted.Figure
2.9 shows the effect of this
operation on the colour image. The two chrominance components
only have significant
values where there is a significant presence or absence of the appropriate colour (for
example, the pink hat appears as an area of relative brightness in the red chrominance).
The equations for converting an RGB image into the Y: Cr : Cb colour space and vice
versa are given in Equations 2.1 and 2.2. Note that G can be extracted from the Y: Cr : Cb
representation by subtracting Cr and Cb from Y.
+ +
= 0.299
R 0.587 G
+ 0.1 14 B
Cb = 0.564 (B - Yj
Cr = 0.713 (R - Yj
R =Y
+ 1.402Cr
G =Y
B =Y
+ 1.772Cb
0.344Cb
0.714Cr
14
DIGITAL VIDEO
(c)
15
COLOUR SPACES
0 0 0 0
0
0 0 0 0
4:2:0
4:4:4
0 Ysample
Cr sample
Cb sample
Example
Image resolution: 720 x 576 pixels
Y resolution: 720 x 576 samples, each represented with 8 bits
4 :4 :4 Cr, Cb resolution: 720 x 576 samples, each 8 bits
Total number of bits: 720 x 576 x 8 x 3 = 9 953 280 bits
4 : 2 :0 Cr, Cb resolution: 360 x 288 samples, each 8 bits
Total number of bits: (720 x 576 x 8) (360 x 288 x 8 x 2) = 4 976640 bits
0
0 0
Figure 2.11 4 pixels: 24 and 12 bpp
16
iric
DIGITAL VIDEO
ir
retina
fovea
--+
brain
optic nerve
Eye: The image isfocused by the lens onto the photodetecting area of the eye, theretina.
Focusing and object tracking are achieved by the eye muscles and the iris controls the
aperture of the lens and hence the amount of light entering the eye.
Retina: The retina consists of an array of cones (photoreceptors sensitive to colour at
high light levels) androds (photoreceptors sensitive to luminance at low light levels). The
more sensitive cones are concentrated in a central region (the fovea) which means that
high-resolution colour vision is only achieved over a small area at the centre of the field
of view.
Optic nerve: This carries electrical signals from the retina to the brain.
Brain: The human brain processes and interprets visual information, based partly on the
received information (theimagedetected by theretina)and partly on prior learned
responses (such as known object shapes).
The operation of the HVS is a large and complex area of study. Some of the important
features of the HVS that have implications for digital video system design are listed in
Table 2.3.
VIDEO QUALITY
17
Measuring visual quality using objective criteria gives accurate, repeatableresults, but as yet
there are no objective measurement systems that will completely reproduce the subjective
experience of a human observer watching a video display.
2.5.1 SubjectiveQualityMeasurement
Several test procedures for subjective quality evaluation are defined in ITU-R Recommendation BT.500-10. One of the most popular of these quality measures is the double stimulus
continuous quality scale (DSCQS) method. An assessor is presented with a pair of images or
short video sequences A and B, one after theother, and is asked to giveA and B a score by
marking on a continuous line with five intervals. Figure 2.13 shows an example of the rating
form on which the assessor grades each sequence.
In a typical test session, the assessor is shown a series of sequence pairs and is asked to
grade each pair. Within each pair of sequences, one is an unimpaired reference sequence
and the other is the same sequence, modified by a system or process under test. A typical
example from the evaluation of video coding systems is shown in Figure 2.14: the original
sequence is compared with the samesequence, encoded and decodedusing a video CODEC.
The order of the two sequences, original and impaired, is randomised during the test
session so that the assessor does not know which is the original and which is the impaired
sequence. This helps prevent the assessor from prejudging the impaired sequence compared
with the reference sequence. At theend of the session, thescores are converted to a
normalised range and the result is a score (sometimes described as a mean opinion score)
that indicates the relative quality of the impaired and reference sequences.
18
DIGITAL VIDEO
Test 1
A
Test 2
Test 3
B
Excellent
Good
Fair
Poor
Bad
The DSCQS test is generally accepted as a realistic measure of subjective visual quality.
However, it suffers from practical problems. The results can
vary significantly, depending on
the assessor and also on the video sequence under test. This variation can be compensated
for by repeating the test with several sequences and several assessors. An expert assessor
(e.g. one who is familiarwith the natureof video compression distortions or artefacts)
may
give a biased score and it is preferable to use non-expert assessors.In practice this means
that a large pool of assessors is required because a non-expert assessor
will quickly learn to
recognise characteristic artefacts in the video sequences. These factors make it expensive
and time-consuming to carry out the DSCQS tests thoroughly.
A second problem is that this testonly
is really suitable forshort sequences of video. It has
been shown2 that the recency effect means that the viewers opinion is heavily biased
towards the lastfew seconds of a video sequence: the quality
of this last section will strongly
influence the viewers rating for the whole of a longer sequence. Subjective tests are also
influenced by the viewing conditions: a test carried out in a comfortable, relaxed environment will earn a higher rating than the same test carried out in a less comfortable setting.
Source video
sequence
Display
Video
encoder
Video
decoder
J
VIDEO QUALITY
19
2.5.2 ObjectiveQualityMeasurement
Because of the problemsof subjective measurement, developers
of digital video systemsrely
heavily on objective measures of visual quality. Objective measures have not yet replaced
subjective testing:however, they are considerably easier toapply and are particularly useful
during development and for comparison purposes.
Probably the most widely used objective measure is peak signal to noise ratio (PSNR),
calculated using Equation 2.3. PSNR is measured on a logarithmic scale and isbased on the
mean squared error (MSE) between
an original and an impaired image or video frame,
relativeto (2 (the square of the highest possible signal value in the image).
PSN~B
= lolog,,
(2 MSE
PSNR can be calculated very easily and is therefore a very popular quality measure. It is
widely used as a method of comparing the qualityof compressed and decompressed video
images. Figure 2.15 shows some examples: the
first image (a) is the original and (b), (c) and
(d) are compressed and decompressed versions of the original image. The
progressively
poorer image quality is reflected by a corresponding drop in PSNR.
The PSNR measure suffers from a number
of limitations, however. PSNR requires an
unimpaired original image for comparison: thismay not be available in every case and it
may not be easy to verify that an original image has perfect fidelity. A more important
limitation is that PSNR does not correlate well with subjective video quality measures such
as ITU-R 500. For a given image or image sequence, high PSNR indicates relatively high
quality and low PSNR indicates relatively low quality. However, a particularvalue of PSNR
does not necessarily equate to an absolute subjective quality. For example, Figure 2.16
shows two impaired versions of the original image from Figure 2.15. Image (a) (with a
blurred background)has a PSNR of 32.7 dB,whereas image (b) (with ablurred foreground)
has a higher PSNR of 37.5 dB. Most viewers would rate image (b) as significantly poorer
than image (a):however, the PSNR measure simply counts the mean squared
pixel errors and
by this method image (b) is ranked as better
than image (a). This example shows
that
PSNR ratings do not necessarily correlate with true subjective quality.
Because of these problems, therehas been a lot of work in recent years to try to develop a
moresophisticatedobjective
test thatcloselyapproachessubjective
test results. Many
differentapproaches have been p r ~ p o s e d , ~but
- ~ none of these has emergedasclear
alternatives to subjectivetests. With improvements in objectivequalitymeasurement,
however, some interesting applications become possible,
such as proposals for constantquality video coding6 (see Chapter 10, Rate Control).
ITU-R BT.500-10 (and more recently, P.910) describe standard methods for subjective
quality evaluation: however, as yet there is no standardised, accurate system for objective
(automatic) quality measurement that issuitablefordigitallycodedvideo.
In recognition of this, the ITU-T Video Quality Experts Group (VQEG) are developing a standard
for objective video quality evaluation7. The first step in this process was to test and compare potential models for objective evaluation. In March 2000, VQEG reported on the first
round of testsin which 10 competingsystemswere
tested underidenticalconditions.
20
DIGITAL,VIDEO
.I
(b)
Figure 2.15 PSNR examples: (a) original; (b) 33.2dEi; (c) 31.8dB; (d) 26.5 dB
VIDEO QUALITY
(4
21
22
DIGITAL VIDEO
23
STANDARDS
REPRESENTING
FOR DIGITAL
VIDEO
864
720
25 Hz frame rate
60
525
858
429
8
216 Mbps
480
720
360
50
625
432
8
216Mbps
576
360
Unfortunately, none ofthe 10 proposals was considered suitable for standardisation. The
problem of accurate objective quality measurement is therefore likely to remain for some
time to come.
The PSNR measure is widely used as an approximate objective measure forvisual quality
and so we will use this measure for quality comparison in this book. However, it is worth
remembering the limitations of PSNR when comparing different systems and techniques.
704
Sub-QCIF
Quarter CIF (QCIF)
CIF
4CIF
24
DIGITAL VIDEO
4CIF 704x 576
resolution of 352 x 288 pixels. The resolutions of these formats are listed in Table 2.5 and
their relative dimensions are illustrated in Figure 2.17.
2.7 APPLICATIONS
The last decade has seen a rapid increase in applications for digital video technology and
new, innovative applications continue to emerge. A small selection is listed here:
0
usually
SUMMARY
25
video is seen as an important component of this in the form of stored video material and
video conferencing.
Remote medicine: Medical support provided at a distance, or telemedicine,
is another
potential growth area where digital video and images may be used together with other
monitoring techniques to provide medical advice at a distance.
Television: Digital television is now widely available and many countries have a timetable for switching off the existing analoguetelevision service. Digital TV is one of the
most important mass-market applications for video coding and compression.
Video production:Fully digital video storage, editing and production have been widely used in
television studios for many years. The requirement for high image fidelity often means
that the popular lossy compression methods described in this book are not an option.
Games and entertainment: The potential for real video imagery in the computer gaming
market is just beginning to
be realised with the convergence of 3-D graphics and natural
video.
2.7.1
Platforms
Developers are targeting an increasing range of platforms to run the ever-expanding list of
digital video applications.
Dedicated platforms are designed to support a specific video application and no other.
Examples include digital video cameras, dedicated video conferencing systems, digital TV
set-top boxes and DVD players. In the early days, the high processing demands of digital
video meant that dedicated platforms were the only practical design solution. Dedicated
platforms will continue to be importantforlow-cost,mass-marketsystems
but are
increasingly being replaced by more flexible solutions.
The PC has emerged as a key platform for digital video. A continual increase in PC
processing capabilities (aidedby hardware enhancements for media applications such as the
Intel MMX instructions) means that it is
now possible to support a wide range
of video
applications from video editing to real-time video conferencing.
Embeddedplatforms areanimportant
new marketfordigitalvideotechniques.For
example, the personalcommunicationsmarketis
now huge,drivenmainly by users of
mobiletelephones. Video servicesformobiledevices(running
on low-costembedded
processors) are seen as a major potential growth area. This type
of platform poses many
challenges for application developers due to the limited processing power, relatively poor
wireless communications channel and the requirement to keep equipment and usage costs to
a minimum.
2.8 SUMMARY
Sampling ofan analoguevideosignal, both spatially and temporally,producesadigital
video signal. Representingacolourscenerequiresatleastthreeseparatecomponents:
popular colour spaces include red/green/blue and Y/Cr/Cb (which has the advantage that
the chrominance may be subsampled to reduce the information rate without significant loss
26
VIDEO
DIGITAL
of quality). Thehuman observers response to visual information affects the way we perceive
videoqualityand
this is notoriously difficult toquantify accurately. Subjective tests
(involving real observers) are time-consuming and expensive to run; objective tests range
from the simplistic (but widely used)
PSNR measure to complex modelsof the human visual
system.
The digital video applications listed above have been made possible by the development
of compression or coding technology. In the next chapter we introduce the
basic concepts of
video and image compression.
REFERENCES
1. Recommendation ITU-T BT.500-10, Methodology for the subjective assessment of the quality of
television pictures, ITU-T, 2000.
2. R. Aldridge, J. Davidoff, M. Ghanbari, D. Hands and D. Pearson, Subjective assessment of timevarying coding distortions, Proc. PCS96, Melbourne, March 1996.
3. C. J. van den Branden Lambrecht and 0. Verscheure, Perceptual quality measure using a spatiotemporal model of the Human Visual System, Digital Video Compression Algorithms and Technologies, Proc. SPIE, Vol. 2668, San Jose, 1996.
4. H. Wu, Z. Yu, S. Winkler and T. Chen, Impairment metrics for MC/DPCM/DCT encoded digital
video, Proc. PCSOI, Seoul, April 2001.
S . K. T. Tan and M. Ghanbari, A multi-metric objective picture quality measurement model for MPEG
video, IEEE Trans. CSVT, 10(7), October 2000.
6. A. Basso, I. Dalgiq, F. Tobagi and C. J. van den Branden Lambrecht, A feedback control scheme for
low latency constant quality MPEG-2 video encoding, Digital Compression Technologies and
Systems for Video Communications, Proc. SPIE, Vol. 2952, Berlin, 1996.
7. http://www.vqeg.org/ [Video Quality Experts Group].
8. Recommendation ITU-R BT.601-S, Studio encoding parameters of digital television for standard
4 : 3 and wide-screen 16 : 9 aspect ratios, ITU-T, 1995.
IMAGE
AND
28
VIDEO COMPRESSION
FUNDAMENTALS
ution
secondper BitsLuminance
Chrominance
perFrames
Format
601 ITU-R
352 CIF
176
QCIF
429
858 x 525
x 288
x 144
x 525
Mbps
176 x 14436.5
88 x 72
30
30
15
216Mbps
4.6 Mbps
/ Typical
1-2
Mbps
Typical
128 kbps
upstream
kbps
/ 33
connection. At high bandwidths, compression can support amuch higher visual quality.For
example, a4.7 Gbyte DVD can store approximately
2 hours of uncompressed QCIF video (at
15 frames per second)or 2 hours of compressed ITU-R 601 video (at 30 frames per second).
Most users would prefer to see television-quality video with smooth motion rather than
postage-stamp video with jerky motion.
Video compression and video CODECs will therefore remain a vital part of the emerging multimedia industry for the foreseeable future, allowing designers
to make the most
efficient use of available transmission or storage capacity. In this chapter we introduce the
basiccomponents of an imageorvideocompressionsystem.
We begin by defining the
concept of an image or video encoder (compressor) and decoder (decompressor). We then
describe the main functional blocks of an image encodeddecoder (CODEC) and a video
CODEC.
3.2 IMAGEANDVIDEOCOMPRESSION
Information-carrying signals may be compressed, i.e. converted to a representation or form
that requires fewer bits than the original (uncompressed) signal. A device or program that
compresses a signal is an encoder and a device or program that decompresses a signal is a
decoder. An enCOderDECoder pair is a CODEC.
Figure 3.1 shows a typical exampleof a CODEC as partof a communication system. The
original (uncompressed) information is encoded (compressed): this is
source coding. The
source coded signal isthen encoded further to add error protection (channel coding)prior to
transmission over a channel. At thereceiver,a channeldecoder detects andor corrects
transmission errors and asource decoder decompresses the signal. The decompressedsignal
maybe identicaltotheoriginal
signal (lossless compression) or it maybe distorted or
degraded in some way (lossy compression).
29
General-purpose compression CODECs are available that are designed to encode and
compress data containing statistical redundancy. An information-carrying signal usually
contains redundancy, which means that it may (in theory) be represented ina more compact
way. For example, characters within a text file occur with varying frequencies: in English,
the letters E, T and A occur more oftenthan the letters Q, Z and X. This makes it possible to
compress a text file by representing frequently occurring characters with short codes and
infrequently occurring characters
with longer codes (this
principle is used in Hufinan coding,
described in Chapter 8). Compression is achieved by reducing the statistical redundancy in
the text file. This type of general-purpose CODEC isknown as an entropy CODEC.
Photographic images and sequences of video frames are not amenable to compression
using general-purpose CODECs. Their contents (pixel values) tend to be highly correlated,
i.e. neighbouring pixels have similar values, whereas an entropy encoderperforms best with
data values that have a certaindegree of independence(decorrelateddata).Figure
3.2
illustrates the poor performance of a general-purpose entropy encoder with image data.
The original image (a) is compressed and decompressed using a ZIP program to produce
(c)
decoded;
and(c)
P E G encoded
decoded
and
30
Original
image l video
-----)
Source model
Entropy encoder -
4
Store l transmit
Decoded
image / video
image (b). This is identical to the original (lossless compression), but the compressed file is
only 92% of the size of the original, i.e. there is very little compression. Image(c)is
obtained by compressing and decompressing the original using the JPEG compression
method. The compressed version is less than a quarter of the size of the original (over
4 x compression) and the decompressed image looks almost identical to the original. (It is in
fact slightly degraded due to the lossy compression process.)
In this example, the JPEG method achieved good compression performance by applying a
source model to the image before compression. The source model attempts to exploit the
properties of video or image data andto represent it in a form that can readily be compressed
by an entropy encoder. Figure 3.3 shows the basic design of an image or video CODEC
consisting of a source model and an entropy encoderldecoder.
Images and video signals have a number of properties that may be exploited by source
models. Neighbouring samples (pixels) within an image or a video frame tend to be highly
correlated and so there is significant spatial redundancy. Neighbouring regions within
successive video frames also tend to be highly correlated (temporal redundancy). As well
as these statistical properties (statistical redundancy), a source model may take advantage of
subjectiveredundancy, exploiting the sensitivity of the human visual system to various
characteristics of images and video. For example, the HVS is much more sensitive to low
frequencies than to high ones and so it is possible to compress an image by eliminating
certain high-frequency components. Image (c) in Figure 3.2 was compressed by discarding
certain subjectively redundant components of the information: the decoded image is not
identical to the original but the information loss is not obvious to the human viewer.
Examples of image and video source models include the following:
j Previous
line
31
of transmitted pixels
-..................
I
\
Pixel to be predicted
becomes impossible to exactly reproduce the original values at the decoder. DPCM may be
applied spatially (using adjacent pixels in the same frame)and/or temporally (using adjacent
pixels in a previous frame to form the prediction) and gives modest compression with low
complexity.
3.2.2TransformCoding
Theimagesamplesare
transformed intoanotherdomain
(or representation)and
are
represented by transform coeficcients. In the spatial domain (i.e. the original form of the
image), samples arehighly spatially correlated. The aimof transform coding is to reducethis
correlation, ideally leavingasmall number of visually significant transform coefficients
(important to the appearance of the original image) and a large number of insignificant
coefficients (that may be discarded without significantly affecting the visual quality of the
image). The transformprocess itself does not achieve compression: a lossy quantisation
process in which the insignificant coefficients are removed, leaving behind a small number
of significant coefficients, usually follows it. Transform coding (Figure 3.5) forms the basis
of most of the popular imageand video compression systems and is described in more detail
in this chapter and in Chapter 7.
3.2.3Motion-compensatedPrediction
Using a similarprinciple to DPCM, the encoder forms a model
of the current framebased on
the samples of a previously transmitted frame. The encoder attempts to compensate for
motion in a video sequence by translating (moving) or warping the samplesof the previously
transmitted reference frame.The resulting motion-compensated predicted frame (the
model of the currentframe)is subtractedfrom the currentframe to producearesidual
error frame (Figure 3.6). Further coding usually follows motion-compensated prediction,
e.g. transform coding of the residual frame.
32
Motion
compensation
Current
Residual
3.2.4
Model-basedCoding
Theencoderattemptstocreate
a semantic model of thevideoscene,forexample
by
analysing and interpreting the contentof the scene. An example is a talking head model:
the encoder analyses a scene containing a persons head and shoulders (a typical video
conferencing scene) and models the head as a 3-D object. The decoder maintains its own
3-D model of the head. Instead of transmitting information that describes the entire image,
the encoder sends only the animation parameters required move
to
the model, together with
an error signal that compensates for the difference between the modelled scene and the
actualvideoscene(Figure
3.7). Model-based coding has the potentialforfargreater
original
3-D model
3-D model
reconstructed
33
IMAGE CODEC
Block transforms
The spatial imagesamplesare processed in discrete blocks, typically 8 x 8 or 16 x 16
samples. Each block is transformed using a 2-D transform to produce a block of transform
coefficients. The performance of a block-based transform for imagecompression depends on
how well it can decorrelate the information in each block.
The Karhunen-Loeve transform (KLT) has the best performance ofany block-based
image transform. The coefficients produced by the KLT are decorrelated and the energy in
theblockispackedinto
a minimalnumber of coefficients. The KLT is, however, very
computationally inefficient, and it is impractical because the functions required to carry out
the transform (basis functions) must be calculated in advance and transmitted to the
decoder for every image. The discrete cosine transform (DCT) performs nearly as well as
the KLT and is much more computationally efficient. Figure 3.9 shows a 16 x 16 block of
Source
image
Encoder
Transform 4 Quanlise
Reorder
Entropy
encode
Store / transmit
Reorder
Entropy
decode
Rescale
Decoder
34
(b)
Figure 3.9 (a) 16 x 16 block of pixels; (b) DCT coefficients
IMAGE CODEC
35
image samples (a) and the corresponding block of coefficients produced by the DCT (b). In
the original block, the energy is distributed across the 256 samples and the latter are clearly
closely interrelated (correlated). In the coefficient block, the energy is concentrated into a few
significant coefficients (at the top left). The coefficients are decorrelated: this means that the
smaller-valued coefficients may be discarded (for example by quantisation) without
significantly affecting the quality of the reconstructed image block at the decoder.
The 16 x 16 array of coefficients shown in Figure 3.9 represent spatial frequencies in the
original block. At the top left of the array are the low-frequency components, representing
the gradual changes of brightness (luminance) in the original block. At the bottom right of
the array are high-frequency components and these represent rapid changes in brightness.
These frequency components are analogous to the components produced by Fourier analysis
of a time-varying signal (and in fact the DCT is closely related to the discrete Fourier
transform) except that here the components are 2-D. The example shown in Figure 3.9 is
typical for a photographic image: most ofthecoefficients
produced by the DCT are
insignificantand can be discarded. This makes the DCT a powerfultool for image and
video compression.
Image transforms
The DCT is usually applied to small, discrete blocks of an image, for reasons of practicality.
In contrast, an image tran.$orm may be applied to a complete video image (or to a large tile
within the image). The most popular transform of this type is the discrete wavelet transform.
A 2-D wavelet transform is applied to the original image in order to decompose it into a
series of filtered sub-band images (Figure 3.10). Image (a) is processed in a series of stages
to produce the wavelet decomposition image (b). This is made upof a series of components,
each containing a subset of the spatial frequencies in the image. At the top left is a low-pass
filtered version of the original and moving to the bottom right, each component contains
progressively higher-frequency information that adds the detail of the image. It is clear that
the higher-frequency components are relatively sparse, i.e. manyof
the values (or
coefficients) in these components are zero or insignificant. The wavelet transform is thus
an efficientwayof decorrelating or concentrating the important information into a few
significant coefficients.
The wavelet transform is particularly effective for still image compression and has been
adopted as part of the PEG-2000 standard and for still image texture coding in the MPEG-4
standard. Wavelet-based compression is discussed further in Chapter 7.
Another image transform that has received much attention is the so-called fractal
transform. A fractal transform coder attempts to represent an image as a set of scaled and
translated arbitrary basispatterns. Fractal-based coding has not, however,shownsufficiently good performance to be included in any of the international standards for video and
image coding and so we will not discuss it in detail.
3.3.2 Quantisation
The block and image transforms described above do not themselves achieve any compression. Instead, they represent the image in a different domain in which the image data is
36
(b)
Figure 3.10 Waveletdecomposition of image
IMAGE CODEC
37
separated into components of varying importance to the appearance of the image. The
purpose of quantisationisto remove thecomponents of thetransformeddatathatare
unimportantto the visualappearance of theimage and to retainthevisuallyimportant
components.Onceremoved, the lessimportantcomponentscannot
be replaced and so
quantisation is a lossy process.
Example
1. The DCT coefficients shown earlierinFigure 3.9 arequantised by dividingeach
coefficient by an integer. The resulting array of quantised coefficients is shown in
Figure 3.1 l(a): the large-value coefficients map to non-zero integers and the smallvalue coefficients map to zero.
2. Rescaling the quantised array (multiplying each coefficient by the same integer) gives
Figure 3.11(b). The magnitudes of the larger coefficients are similar to the original
coefficients; however, the smaller coefficients (set to zero during quantisation) cannot
be recreated and remain at zero.
3. Applying an inverse DCT to the rescaledarray gives the blockof image samplesshown
in Figure 3.12: this looks superficially similar to the original image blockbut some of
the information has been lost through quantisation.
It is possible to vary the coarseness of the quantisation process (using a quantiser scale
factor or step size). Coarse quantisation
will tend to discard most of the coefficients,
leaving only the most significant,whereas fine quantisationwilltend
to leave more
coefficients in the quantised block. Coarse quantisationusually gives higher compression at
the expense of a greater loss in image quality. The quantiser scale factor or step size is often
the main parameter used to control image quality and compression in an image or video
CODEC. Figure 3.13 shows a small original image (left) and the effect of compression and
decompression with fine quantisation (middle) and coarse quantisation (right).
38
5ooj:.
1000
fo
5
39
IMAGE CODEC
level) pair represents the numberof preceding zeros and the second number represents a
non-zero value (level). For example, (5, 12) represents five zeros followed by 12.
3. Entropy coding. A statistical coding algorithm is applied to the (run, level) data. The
purpose of the entropy coding algorithm is to represent frequently occurring (run, level)
pairs with a short code and infrequently occurring (run, level)
pairs with a longer code.In
this way, the run-level data may be compressed into a small number of bits.
Huffman coding and arithmetic coding are widely used for entropy coding of image and
video data.
Huffman coding replaces each symbol (e.g. a [run, level] pair) with a codeword containing
a variable number of bits. The codewords are allocated
based on the statistical distribution of
40
Non-zero coefficients
I
...,
High frequency
the symbols. Short codewords are allocated to common symbols and longer codewords are
allocated to infrequent symbols. Each codeword is chosen to be uniquely decodeable, so
that a decodercanextracttheseries
of variable-lengthcodewordswithoutambiguity.
Huffman coding is well suited to practical implementation and is widely used in practice.
Arithmetic coding maps a series of symbols to a fractional number (see Chapter 8) that is
then converted into a binary number and transmitted. Arithmetic coding
has the potential for
higher compression than Huffman coding. Eachsymbol may be represented with a fractional
number of bits (rather than just an integral number of bits) and this means that the bits
allocated per symbol may be more accurately matched to the statistical distribution of the
coded data.
3.3.4 Decoding
The output of the entropy encoder is a sequence of binary codes representing the original image
in compressed form. In order to recreate the image it is necessary to decode this sequence
and the decoding process (shown in Figure
3.8) is almost the reverse
of the encoding process.
An entropy decoder extractsrun-level symbols from thebit sequence. These are converted
to a sequence of coefficients that are reordered into a block of quantised coefficients. The
decoding operations up to this point are the inverse of the equivalent encoding operations.
Each coefficient is multipliedby the integer scale factor (rescaled). This is often described
Reordered
coefficient
data
24,3, -9,O, - 2 , 0 , 0 , 0 , 0 , 0 , 1 2 , 0 , 0 , 0 , 2 , .
CODEC
VIDEO
41
as inverse quantisation, but in fact the loss of precision due to quantisation cannot be
reversed and so the rescaled coefficients are not identical to the original transform
coefficients.
The rescaled coefficients are transformed with an inverse transform to reconstruct a
decoded image. Becauseof the data loss during quantisation, this image will not be identical
tothe original image:theamount
of difference depends partly on the coarseness of
quantisation.
prediction
Create
DECODER
Create
prediction
Prevlous
frame@)
Prevlous
frame@)
Figure3.15
42
3.4.1 FrameDifferencing
The simplest predictor is just
the previous transmitted frame. Figure3.16 shows theresidual
frameproduced by subtracting the previous framefrom the currentframe in a video
sequence. Mid-grey areas of theresidualframecontain
zero data:light and dark areas
indicate positive and negative residual data respectively. It is clearthat much of the residual
data is zero: hence, compression efficiency can be improved by compressing the residual
frame rather than the current frame.
Table 3.4
Predictiondrift
Decoder
output/
Encoder
Encoder
prediction
output
Decoder
prediction
input
decoder
input
Encoder
Original frame 1
Zero
Original frame 2
Original frame 1
Original frame 3
Original frame 2
...
Decoded
Zero
Compressed frame
frame 1
Decodedframe 1
Compressed
residual frame 2
Decodedframe 2
Compressed
residual frame 2
Decodedframe 2
Decodedframe 3
...
43
VIDEO CODEC
Subtract
Current
frame
Image encoder
Encoded
frame
Create
prediction
L,!
Previous
frame(s)
Image decoder
loop
The decoder faces a potential problem that can be illustrated as follows. Table 3.4 shows
the sequence of operations required to encode and decode a series of video frames using
frame differencing. Forthefirst
frame the encoder and decoder use no prediction. The
problem starts with frame 2: the encoder uses the original frame 1 as a prediction and
encodes the resulting residual. However, the decoder only has the decoded frame 1 available
to form the prediction. Because the coding process is lossy, there is a difference between the
decoded and original frame 1 which leads to a small error in the prediction of frame 2 at
the decoder. This error will build up with each successive frame and the encoder and decoder
predictors will rapidly drift apart, leading to a significant drop in decoded quality.
The solution to this problem isfor the encoder to use a decoded frame to formthe
prediction. Hence the encoder in the above example decodes (or reconstructs) frame 1 to
form a prediction for frame 2. The encoder and decoder use the same prediction and drift
should be reduced or removed. Figure 3.17 shows the complete encoder which now includes
a decoding loopin order to reconstruct its prediction reference. The reconstructed (or
reference) frame is stored in the encoder and in the decoder to form the prediction for the
next coded frame.
3.4.2 Motion-compensatedPrediction
Frame differencing gives better compression performance than intra-frame coding when
successive frames are very similar, but does not perform well when there is a significant
change between the previous and current frames. Such changes are usually due to movement
in the video scene and a significantly better prediction can be achieved by estimating this
movement and compensating for it.
Figure 3.18 shows a video CODEC that uses motion-compensated prediction. Two new
steps are required in the encoder:
1. Motion estimation: a region of the current frame (often a rectangular block of luminance
samples) is compared with neighbouring regions of the previous reconstructed frame.
44
ENCODER
DECODE
Motion-compensated
........ prediction
.....................
~
Current
f
Previous
I
Image
decoder
Previous
frame@)
SUMMARY
45
compressiondoes not come without aprice: motion estimation can bevery computationally intensive.The design of a motion estimation algorithm canhave a dramatic effecton
the compression performance and computational complexity of a video CODEC.
3.4.4
Decoding
A motion-compensated decoder (Figure 3.18) is usually simpler than the corresponding
encoder.Thedecoderdoes
not need a motion estimationfunction(sincethe
motion
information is transmitted in the coded bit stream) and it contains only a decoding path
(compared with the encoding and decoding paths in the encoder).
3.5 SUMMARY
Efficient coding of images and video sequences involves creating a model of the source data
that convertsitintoaform
that can be compressed. Most image and videoCODECs
developed over the last two decades have been based around a common set of building
blocks. For motion video compression, the first step is to create a motion-compensated
prediction of the frame to be compressed, based on one or more previously transmitted
frames. The difference betweenthis model and the actual input frame isthen coded using an
imageCODEC.Thedataistransformedintoanotherdomain(e.g.the
DCT or wavelet
domain), quantised, reordered and compressed using an entropy encoder. A decoder must
reverse these steps to reconstruct the frame:
however, quantisation cannotbe reversed and so
the decoded frame is an imperfect copy of the original.
An encoder and decoder must clearly use a compatible set of algorithms in order to
successfully exchange compressed image
or video data.Of prime importance is the syntax
or
structure of the compressed data.In the past 15 years there has been a significant worldwide
effort to develop standards for video and image compression. These standards generally
describe a syntax (and a decoding process) tosupport video or image communications for a
wide range of applications. Chapters 4 and 5 provide an overview of the main standards
bodies and JPEG, MPEG and H . 2 6 ~video and image coding standards.
48
VIDEO
CODING
STANDARDS:
JPEGMPEG
AND
In order to provide the maximum flexibility and scope for innovation, the standards do not
define a video or image encoder: this is left to the designers discretion. However, in practice
the syntax elements and reference decoder limit the scope for alternative designs that still
meet the requirements of the standard.
4.2.1
The most important developments in video coding standards have been dueto two
international standards bodies: the ITU (formerly the CCITT) and the ISO. The ITU has
concentrated on standards to support real-time, two-way video communications. The group
responsible for developing these standards is known as VCEG (Video Coding Experts
Group) and has issued:
0
H.261 (1990): Video telephony over constant bit-rate channels, primarily aimed at ISDN
channels of p x 64 kbps.
H.263 (1995): Video telephony over circuit- and packet-switched networks, supporting a
range of channels from low bit rates (20-30 kbps) to high bit rates (several Mbps).
The H . 2 6 ~series of standards will be described in Chapter 5. In parallel with the ITUs
activities, the IS0 has issued standards to support storage and distribution applications. The
two relevant groups are JPEG (Joint Photographic Experts Group) and MPEG (Moving
Picture Experts Group) and they have been responsible for:
0
0
Since releasing Version 1 of MPEG-4, the MPEG committee has concentrated on framework standards that are not primarily concerned with video coding:
0
INTERNATIONAL
THE
STANDARDS
BODIES
49
MPEG-219: Multimedia Framework. The MPEG-21 initiative looks beyond coding and
indexing to the complete multimedia content delivery chain, from creation through
production and delivery to consumption (e.g.viewing the content). MPEG-21 will
definekey elements of this delivery framework, including content description and
identification, content handling, intellectual property management, terminal and network
interoperation and content representation. The motivation behind MPEG-21 is to encourage integration and interoperation between the diverse technologies that are required to
create, deliver and decode multimedia data. Work on the proposed standard started in
June 2000.
Figure 4.1 shows the relationship between the standards bodies, the expert groups and the
video coding standards. The expert groups have addressed different application areas (still
images, video conferencing, entertainment and multimedia), but in practice there are many
overlaps between the applications of the standards. For example, a version of JPEG, Motion
JPEG, is widely used for video conferencing and video surveillance; MPEG-1 and MPEG-2
have been used for video conferencing applications; and the core algorithms of MPEG-4 and
H.263 are identical.
In recognition of these natural overlaps, the expert groups have cooperated at several
stages and the result ofthis cooperation has led to outcomes such as the ratification of
MPEG-2 (Video) as ITU standard H.262 and the incorporation of baseline H.263 into
MPEG-4 (Video). There is also interworking between the VCEG and MPEG committees and
50
VIDEO
CODING
STANDARDS:
MPEG
AND
JPEG
other related bodies such as the Internet Engineering Task Force (IETF), industry groups
(such as the Digital Audio Visual Interoperability Council, DAVIC) and other groups within
ITU and ISO.
4.2.2TheStandardisationProcess
The development of an international standard for image or video coding is typically an
involved process:
1. The scopeandaims
of the standard are defined. For example, the emerging H.26L
standard is designed with real-time video communications applications in mind and aims
to improve performance over the preceding H.263 standard.
2. Potential technologiesfor meeting these aimsare evaluated, typically by competitive
testing. The test scenario and criteria are defined and interested parties are encouraged to
participate anddemonstrate the performance of their proposed solutions. The'best'
technology is chosen based on criteria such as coding performance and implementation
complexity.
3. The chosen technology is implemented as a test model. This is usually a software
implementation that is made available to members of the expert group for experimentation, together with a test model document that describes its operation.
4. The test model is developed further: improvements and features are proposed and
demonstrated by members of the expert group and the best of these developments are
integrated into the test model.
5. At a certain point (dependingon the timescales of the standardisation effort and on
whether the aims of the standard have been sufficiently met by the test model), the model
is 'frozen' and the test model document forms the basis of a drafl standard.
6. Thedraftstandard
standard.
Officially, the standard is not available in the public domain until the final stage of approval
and publication. However, because of the fast-moving nature of the video communications
industry, draft documents and test models can be very useful for developers and manufacturers. Many of the ITU VCEG documents and models are available via public FTP." Most
of the MPEG working documents arerestricted to members of MPEG itself, but a number of
overview documents are available at the MPEG website." Information and links about JPEG
and MPEG are a ~ a i l a b l e . ' ~Keeping
.'~
in touch with the latest developments and gaining
access to draft standards are powerful reasons for companies and organisations to become
involved with the MPEG, JPEG and VCEG committees.
theStandards
Published ITU and I S 0 standards may be purchased from the relevant standards body.'.* For
developers of standards-compliant video coding systems, the published standard is an
51
essential point of reference as it defines the syntax and capabilities that a video CODEC
must conform to in order to successfully interwork with other systems. However, the
standards themselves are not an ideal introduction to the concepts and techniques of video
coding: the aim of the standard is to define the syntax as explicitly and unambiguously as
possible and this does not make for easy reading.
Furthermore, the standards do not necessarily indicate practical constraints that a designer
must take into account. Practical issues and good design techniques are deliberately left to
the discretion of manufacturers in order to encourage innovation and competition, and so
other sources are a much better guide to practical design issues. This book aims to collect
together information and guidelines for designers and integrators; other texts that may be
useful for developers are listed in the bibliography.
The test models produced by the expert groups are designed to facilitate experimentation
and comparison of alternative techniques, and the test model (a software model with an
accompanying document) can provide a valuable insight into the implementation of the
standard. Further documents such as implementation guides (e.g. H.263 Appendix IIII4) are
produced by the expert groups to assist with the interpretation of the standards for practical
applications.
In recent years the standards bodies have recognised the need to direct developers towards
certain subsets of the tools and options available within the standard. For example, H.263
now has a total of 19 optional modes and it is unlikely that any particular application would
need to implement all of these modes. This has led to the concept of profiles and levels. A
profile describes a subset of functionalities that may be suitable for a particular application
and a level describes a subset of operating resolutions (such as frame resolution and frame
rates) for certain applications.
4.3 JPEG(JOINTPHOTOGRAPHICEXPERTSGROUP)
4.3.1 JPEG
International standard I S 0 109183 is popularly known by the acronym of the group that
developed it, the Joint Photographic Experts Group. Released in 1992, it provides a method
and syntax for compressing continuous-tone still images (such as photographs). Its main
application is storage and transmission of still images in a compressed form, and it is widely
used in digital imaging, digital cameras, embedding images in web pages, and many more
applications. Whilst aimed at still image compression, JPEG has found some popularity as a
simple and effective method of compressing moving images (in the form of Motion JPEG).
The JPEG standard defines a syntax and decoding process for a baseline CODEC and this
includes a set of features that are designed to suit a wide range of applications. Further
optional modes are defined that extend the capabilities of the baseline CODEC.
52
Figure 4.2
may be processed separately (one complete component ata time) or in interleaved order
(e.g.
a block from eachof three colour components in succession). Each block
is coded using the
following steps.
Level shift Input data is shifted so that it is distributed about zero: e.g. an 8-bit input
sample in the range 0 :255 is shifted to the range - 128 : 127 by subtracting 128.
Forward DCT An 8 x 8 block transform, described in Chapter 7.
Quantiser Each of the 64 DCT coefficients C, is quantised by integer division:
Cqij = round
(2)
24
~'. .
64 78 87 103 121
33
95
98 1
Highfrequencies
Figure 43 PEG
quantisationmap
PHOTOGRAPHIC
(JOINT
JPEG EXPERTS
GROUP)
53
gives an example of a quantisation map: the weighting means that the visually important
lower frequencies (to the top left of the map) are preserved and the less important higher
frequencies (to the bottom right) are more highly compressed.
The prediction DCpredis coded and transmitted, rather than the actual coefficient DC,,,.
Example
A run of six zeros followed by the value +5 would be coded as:
[RRRR=6] [SSSS=3] [Value= $51
Markerinsertion
Markercodesareinsertedintotheentropy-coded
data sequence.
Examples of markers include the frame header (describing the parameters
of the frame
such as width, height and number of colour components), scan headers (see below) and
restart interval markers (enabling a decoder to resynchronisewith the coded sequence if an
error occurs).
The result of the encoding process is a compressed sequenceof bits, representing the image
data, that may be transmitted or stored. In order to view the image, it must be decoded by
reversing the above steps, starting with marker detection and entropy decoding and ending
with an inverse DCT. Because quantisationis not areversibleprocess(asdiscussed
in
Chapter 3), the decoded image is not identical to the original image.
54
Lossless JPEG
P E G also defines a lossless encoding/decoding algorithm that uses DPCM (described in
Chapter 3). Each pixel is predicted from up to threeneighbouring pixels and the predicted
value is entropy coded and transmitted. Lossless P E G guarantees image fidelity at the
expense of relatively poor compression performance.
Optional modes
Progressive encoding involves encoding the image in a series of progressive scans. The
first scan may be decoded toprovide a coarse representation of the image; decoding each
subsequent scan progressively improves the quality of the image until the final quality is
reached. This can be useful when, for example, a compressed image takes a long time to
transmit: the decoder can quickly recreateanapproximateimagewhichis
then further
refined in a series of passes. Two versions of progressive encoding are supported: spectral
selection, where each scan consists of a subset of the DCT coefficients of every block (e.g.
(a) DC only; (b) low-frequency AC; (c) high-frequency AC coefficients) and successive
approximation, where the first scan containsN most significant bits of each coefficient and
later scans containthe less significant bits. Figure4.4 shows an image encoded and decoded
using progressive spectral selection. The first image contains the DC coefficients of each
block, the second image contains the DCand two lowest AC coefficients and the third
contains all 64 coefficients in each block.
(a)
Figure 4.4 Progressive encoding example (spectralselection): (a) DC only; (b) DC
coefficients
Figure 4.4
55
(Contined)
56
The two progressive encoding modes and the hierarchical encoding mode can be thought
of as scalable coding modes. Scalable coding will be discussed further in the section on
MPEG-2.
4.3.2
Motion JPEG
A Motion JPEG or MJPEG CODEC codes a video sequence as a series of JPEG images,
each corresponding to one frame of video (i.e. a series of intra-coded frames). Originally,
the JPEG standard was not intended to beusedin
this way: however, MJPEG has
become popular and is used in a number of video communications and storage applications. No attempt is made to exploit the inherent temporal redundancy in a moving video
sequence and so compression performance is poor compared with inter-frame CODECs (see
Chapter 5 , Performance Comparison). However, MJPEG has a number of practical
advantages:
0
Low complexity: algorithmic complexity, and requirements for hardware, processing and
storage are very low compared with even a basic inter-frame CODEC (e.g. H.261).
Error tolerance: intra-frame codinglimits the effect of an error to a single decoded frame
and so is inherently resilient to transmission errors. Until recent developments in error
resilience (see Chapter 1 l), MJPEG outperformed inter-frame CODECs innoisy
environments.
Market awareness: JPEG is perhaps the most widely known and used of the compression
standards and so potential users are already familiar with the technology of Motion JPEG.
Because of its poor compression performance, MJPEG is only suitable for high-bandwidth
communications (e.g. over dedicated networks). Perversely, this means that users generally
have a good experience of MJPEG because installations do not tend to suffer from the
bandwidth and delay problems encountered by inter-frame CODECs used over best effort
networks (such as the Internet) or low bit-rate channels. An MJPEGcoding integrated
circuit(IC), the Zoran ZR36060, is described in Chapter 12.
4.3.3
JPEG-2000
The original JPEG standard has gained widespread acceptance and is now ubiquitous
throughout computing applications: it is the main format for photographic images on the
world wide web and it is widely used for image storage. However, the block-based DCT
algorithm has a number of disadvantages, perhaps the most important of which is the
blockiness of highly compressed JPEG images (see Chapter 9). Since its release, many
alternative coding schemes have been shown to outperform baseline JPEG. The need for
better performance at high compression ratios led to the development of the JPEG-2000
The features that JPEG-2000 aims to support are as follows:
0
57
PHOTOGRAPHIC
(JOINT
JPEG EXPERTS
GROUP)
0
Efficient compression of continuous-tone, bi-level and compound images (e.g. photographic images with overlaid text: the original JPEG does not handle this type of image
well).
Lossless and lossy compression (within the same compression framework).
Error resilience tools including data partitioning (see the description of MPEG-2 below),
error detection and concealment (see Chapter 11 for more details).
Open architecture. The JPEG-2000 standard provides an open framework which should
make it relatively easy to add further coding features either as part of the standard or as a
proprietary add-on to the standard.
The architecture of a JPEG-2000 encoder is shown in Figure 4.5. This is superficially similar
to the JPEG architecture but one important difference is that the same architecture may be
used for lossy or lossless coding.
The basic coding unit of JPEG-2000 is a tile. This is normally a 2 x 2 region of the
image, and the image is covered by non-overlapping identically sized tiles. Each tile is
encoded as follows:
Transform: A wavelet transform is carried out on each tile to decompose it into a series of
sub-bands (see Sections 3.3.1 and 7.3). The transform may be reversible (for lossless
coding applications) or irreversible (suitable for lossy coding applications).
Quantisation: The coefficients of the wavelet transform are quantised (as described in
Chapter 3) according to the importance of each sub-band to the final image appearance.
There is an option to leave the coefficients unquantised (lossless coding).
Entropy coding: JPEG-2000 uses a form of arithmetic coding to encode the quantised
coefficients prior to storage or transmission. Arithmetic coding can provide better
compression efficiency than variable-length coding and is described in Chapter 8.
The result is a compression standard that can give significantly better image compression
performance than JPEG. For the same image quality, JPEG-2000 can usually compress
images by at least twice as much as JPEG. At high compression ratios, the quality of images
Image
data
-1
transform
wavelet
Quantiser
Arithmetic
encoder
-1
I
I
l
l
Figure 4.5
Architecture of JPEG-2000encoder
58
VIDEO CODING
STANDARDS:
MPEG
AND
JPEG
degrades gracefully, with the decoded image showing a gradual blurring effect rather than
themoreobviousblockingeffectassociated
with the DCT. Theseperformancegains
areachievedattheexpense
of increasedcomplexity and storagerequirementsduring
encoding and decoding. One effect of this is that images take longer to store
and display
using JPEG-2000 (though this shouldbe less of an issue as processors continue
to get faster).
4.4 MPEG(MOVINGPICTUREEXPERTSGROUP)
4.4.1
MPEG-1
The first standardproduced by theMovingPictureExpertsGroup,popularly
known as
MPEG- 1, was designed to provide videoand audio compression for storageand playback on
CD-ROMs. A CD-ROM played at single speed has a transfer rate of 1.4 Mbps. MPEG-1
aims to compress videoand audio to abit rate of 1.4 Mbpswith a quality that is comparable
to VHS videotape. The target market was the video CD, a standard CD containing up to
70 minutes of stored video and audio. The video CD was never a commercial success: the
quality improvement over VHS tape was not sufficient to tempt consumers to replace their
video cassette recorders and the maximum length
of 70 minutes createdan irritating break in
a feature-lengthmovie. However, MPEG-1isimportantfor
two reasons: it has gained
widespread use in other video storage and transmission applications (including CD-ROM
storage as part of interactive applications and video playback over the Internet), andits
functionality is used and extended in the popular MPEG-2 standard.
The MPEG-1 standard consistsof three parts. Part 116 deals with system issues (including
the multiplexingof coded videoand audio), PartZ4 deals with compressed videoand Part 317
with compressed audio. Part 2 (video)
was developed with aim of supporting efficient coding
of video for CD playback applications and achieving video quality comparable to, or better
than, VHS videotape at CD bit rates (around 1.2Mbps for video). There was a requirement
tominimisedecodingcomplexitysince
most consumerapplications were envisagedto
involve decoding and playbackonly, not encoding. HenceMPEG- 1 decodingis considerably
simpler than encoding (unlike JPEG, where the encoder and decoder have similar levels of
complexity).
MPEG-I features
The input video signal to an MPEG-1 video encoder is 4:2 :0 Y :Cr :Cb format (see Chapter 2)
with a typical spatial resolution of 352 x 288 or 352 x 240 pixels. Each frame of video is
processed in units of a macroblock, corresponding to a 16 x 16 pixel area in the displayed
frame. This area is made up of 16 x 16 luminance samples, 8 x 8 Cr samples and 8 x 8 Cb
samples (because Cr andCb have half the horizontaland vertical resolutionof the luminance
component). A macroblock consists of six 8 x 8 blocks: four luminance (Y) blocks, one Cr
block and one Cb block (Figure 4.6).
Eachframe of video is encoded to produceacoded
picture. Therearethree
main
types: I-pictures, P-pictures and B-pictures. (The standard specifies a fourth picture type,
D-pictures, but these are seldom used in practical applications.)
59
16
Figure 4.6
Structure of amacroblock
l-pictures are intra-coded without any motion-compensated prediction (in a similar way
to a baseline JPEG image).An I-picture is used as a reference for further predicted pictures
(P- and B-pictures, described below).
P-pictures are inter-coded using motion-compensated prediction from areference picture
(the P-picture or I-picture preceding the current P-picture). Hence a P-picture is predicted
using forward prediction and a P-picture may itself be used asareferenceforfurther
predicted pictures (P- and B-pictures).
B-pictures areinter-coded using motion-compensatedpredictionfrom
two reference
pictures, the P- and/or I-pictures before and after the current B-picture.Two motion vectors
are generated for each macroblock in a B-picture (Figure 4.7): one pointing to a matching
area in the previous reference picture (aforward vector) and one pointing to a matching area
B-picture
Current macroblock
l
Forward
reference
area
vector
Backward
reference
area
Figure 4.7
60
/
Bo
I1
/
Figure 4.8
in the futurereferencepicture(a
backward vector). A motion-compensated prediction
macroblock can be formed in three
ways: forward prediction using the forward vector,
backwardsprediction using thebackwardvectororbidirectionalprediction
(where
the predictionreferenceisformed
by averaging theforward and backwardprediction
references). Typically, anencoderchoosesthepredictionmode(forward,backwardor
bidirectional) that gives the lowest energy in the difference macroblock. B-pictures are not
themselves used as prediction references for any furtherpredicted frames.
Figure 4.8 shows a typical series of I-, B- and P-pictures. In order to encodea B-picture,
two neighbouring I- or P-pictures(anchor pictures or key pictures) must beprocessed and
stored in the prediction memory, introducing a delay of several frames into the encoding
procedure. Before frameB2 in Figure 4.8 can be encoded, its twoanchor frames 11 and P4
must be processed and stored, i.e. frames 1-4 must be processed before frames 2 and 3 can
be coded. In this example, there is a delay of at least three frames during encoding (frames
2,
3 and 4 must be stored before B2 can be coded) and this delay will be larger if more Bpictures are used.
In ordertolimitthedelayatthe
decoder, encodedpicturesare
reordered before
transmission, such that all the anchor pictures required to decode a B-picture are placed
before the B-picture. Figure 4.9 shows the sameseries of frames,reorderedpriorto
transmission. P4 is now placed before B2 and B3. Decoding proceeds as shown in Table
4.1: P4 is decoded immediately afterI1 and is stored by the decoder. B2 and B3 can now be
decoded and displayed (because their prediction references, I1 and P4,are both available),
after which P4 is displayed. There is at most one frame delay between
decoding and display
and thedecoderonly
needs tostoretwodecodedframes.
This is oneexample of
asymmetry betweenencoderand
decoder: the delay andstorage in thedecoder are
significantly lower than in the encoder.
61
Decode
I-pictures are useful resynchronisation points in the coded bit stream: because it is coded
without prediction, an I-picture may be decoded independently of any other coded pictures.
This supports random access
by a decoder(a decoder may start decoding the bit stream atany
I-picture position) and error resilience (discussed in Chapter 11). However, an I-picture has
poor compression efficiencybecausenotemporalprediction
is used. P-pictures provide
better compression efficiency due to motion-compensatedpredictionandcanbeused
as
prediction references. B-pictureshave the highest compression efficiency of each of the three
picture types.
The MPEG-1 standard does notactuallydefine the design ofan encoder: instead, the
standard describes the coded syntax and a hypothetical reference decoder. In practice, the
syntax and functionality described by
the standard mean that a compliant encoder has to
contain certain functions. The basic CODEC is similar to Figure 3.18. A front end carries
out motion estimation and compensation based on one reference frame (P-pictures) or two
reference frames (B-pictures). The motion-compensated
residual (or the original picture data
in thecase of an I-picture)is encodedusingDCT,
quantisation, run-level coding and
variable-length coding. In an I- or P-picture, quantised transform coefficients are rescaled
andtransformedwith
the inverse DCT to produce a stored referenceframefor further
predictedP-orB-pictures.In
the decoder, the coded datais entropy decoded, rescaled,
inversetransformedandmotion
compensated. The mostcomplex part of the CODEC is
often the motion estimatorbecause bidirectional motion estimationiscomputationally
intensive. Motion estimation is only required in the encoder and this is another example
of asymmetry between the encoder and decoder.
MPEG-I syntax
The syntax of an MPEG- 1 coded video sequence forms a hierarchy asshown in Figure 4.10.
The levels or layers of the hierarchy are as follows.
62
Sequence
Group of Pictures
Picture
.
... . .
...
.. .
Slice
Figure4.10
...
MPEG- 1 synatxhierarchy
GOP layer A GOP is one I-picture followed by a series of P- and B-pictures (e.g. Figure
4.8). In Figure 4.8, the GOP contains nine pictures (one I, two P and six B) but many other
GOP structures are possible, for example:
(a) All GOPs contain just one I-picture, i.e. no motion compensated prediction is used: this
is similar to Motion JPEG.
(b) GOPs contain only I- and P-pictures, i.e. no bidirectional prediction is used: compression efficiency is relatively poor but complexityis low (since B-picturesare more
complex to generate).
(c) LargeGOPs:theproportion
of I-pictures in thecodedstreamis
lowand
hence
compression efficiency is high. However, there are few synchronisation points which
may not be ideal for random access and for error resilience.
(d) Small GOPs: there is a high proportion of I-pictures and so compression efficiency is
low, however there are frequent opportunities for resynchronisation.
An encoder need not keep a consistent GOP structure within a sequence. It may be useful to
vary the structure occasionally, for example by starting a new GOP when a scene change or
cut occurs in the video sequence.
63
Picture layer A picture defines a single coded frame. The picture header describes the
type of coded picture (I, P, B) and a temporal reference that defines when the picture should
be displayed in relation to the otherpictures in the sequence.
Slicelayer A picture is made upof anumber of slices,each of whichcontains an
integralnumber of macroblocks. In MPEG-l thereisnorestriction
on the sizeor
arrangement of slices in a picture, exceptthat slices should cover the picture in rasterorder.
Figure 4.11 shows one possible arrangement: each shaded region in this figure is a single
slice.
A slice starts with a slice header that defines its position. Each slice may be decoded
independently of other slices within the picture and this helps the decoder to recover from
transmission errors: if an error occurswithin a slice, the decodercan always restart decoding
from the next slice header.
Macroblock layer A slice is made upofan integral number of macroblocks, each of
which consists of six blocks (Figure 4.6). The macroblock header describes the
type of
macroblock, motion vector(s) and defines which 8 x 8 blocks actuallycontaincoded
transform data. The picture type (I, P or B) defines the default prediction mode for each
macroblock, but individual macroblocks within P- orB-pictures may beintra-coded if
required (i.e. coded without any motion-compensated prediction). This can be useful if no
good matchcan be found within the search area in
the reference frames since may
it be more
efficient to code the macroblock without any prediction.
Block layer A block contains variable-length code(s) that represent the quantised transform coefficients in an 8 x 8 block. Each DC coefficient (DCT coefficient [0, 01) is coded
differentially from the DC coefficient of the previous coded block, to exploit the fact that
neighbouring blocks tend to have very similar DC (average) values. AC coefficients (all
other coefficients) are coded as a (run, level)
pair, where run indicates the number of
preceding zero coefficients and level the value of a non-zero coefficient.
64
4.4.2
MPEG-2
The next important entertainment application for coded video (after CD-ROM storage) was
digital television. In order to provide an improved alternative to analogue television, several
key features were required of the video coding algorithm. It had to efficiently support larger
frame sizes (typically 720 x S76 or 720 x 480 pixels for ITU-R 601 resolution) and coding
of interlaced video. MPEG-1 was primarily designed to support progressive video, where
eachframeisscanned
as a single unit in raster order.At television-quality resolutions,
interlaced video(where a frameismade
up of two interlaced fields as described in
Chapter 2) gives a smoother video image. Because the two fields are captured at separate
time intervals (typically 1/50 or 1/60 of a second apart), better performance may be achieved
by coding the fields separately.
MPEG-2 consists of three main sections: Video (described below), Audio (based on
MPEG-1audiocoding) and Systems (defining,in more detail than MPEG-lSystems,
multiplexing and transmission of the coded audio/visual stream). MPEG-2 Video is (almost)
a superset of MPEG-I Video, i.e. most MPEG-I video sequences should be decodeable by
an MPEG-2 decoder. The main enhancements added by the MPEG-2 standard are as follows:
65
16x16 region of
luminance component
jt-
Scalability
The progressive modes of P E G described earlier are formsof scalable coding. A scalable
coded bit stream consists of a number of layers, a base layer and one or more enhancement
layers. The base layer can be decoded to provide a recognisable video sequencethat has a
limited visual quality, and a higher-quality sequence may be produced by decoding the base
layer plus enhancement layer(s),
with each extra enhancement layer improving
the quality of
the decoded sequence. MPEG-2 videosupports four scalable modes.
66
VIDEO
CODING
STANDARDS:
MPEG
AND
JPEG
/
j
j
j
Enhancement layer
Simple: 4 : 2 : 0 sampling, only I- and P-pictures are allowed. Complexity is kept low at
the expense of poor compression performance.
Main: This includes all of the core MPEG-2 capabilities including B-pictures and
support for interlaced video. 4 : 2 : 0 sampling is used.
67
SNR: As main profile, except that an enhancement layer is added to provide higher
visual quality.
Spatial: As SNR profile, except that spatial scalability may also be used to provide
higher-quality enhancement layers.
High: As Spatial profile, with the addition of support for 4 : 2 : 2 sampling.
Main: Up to 720
The MPEG-2 standard defines certain recommended combinations of profiles and levels.
Main projilellow level (using only frame encoding) is essentially MPEG-l. Main projilel
main level is suitable for broadcast digital television and this is the most widely used profile/
level combination. Main projile lhigh level is suitable for high-definition television (HDTV).
(Originally, the MPEG working group intended to release a further standard, MPEG-3, to
support coding for HDTV applications. However, once it became clear that the MPEG-2
syntax could deal with this application adequately, work on this standard was dropped and so
there is no MPEG-3 standard.)
In addition to the main features described above, there are some further changes from the
MPEG-1 standard. Slices in an MPEG-2 picture are constrained such that theymaynot
overlap from one row of macroblocks to the next (unlike MPEG- 1 where a slice may occupy
multiple rows of macroblocks). D-pictures in MPEG-1 were felt to be of limited benefit and
are not supported in MPEG-2.
4.4.3
MPEG-4
The MPEG-I and MPEG-2 standards deal with complete video frames, each coded as a
single unit. The MPEG-4standard6 was developed with the aim of extending the capabilities
of the earlier standards in a number of ways.
Support for low bit-rate applications MPEG-1 and MPEG-2 are reasonably efficient for
coded bit rates above around 1 Mbps. However, many emerging applications (particularly
Internet-based applications) require a much lower transmission bit rate and MPEG-1 and 2
do not support efficient compression at low bit rates (tens of kbps or less).
Supportforobject-basedcoding
Perhaps the most fundamental shift in the MPEG-4
standard has been towards object-based or content-based coding, where a video scene can be
handled as a set of foreground and background objects rather than justas a series of
rectangular frames. This type of coding opens up a wide range of possibilities, such as
independent coding of different objects in a scene, reuse of scene components, compositing
68
VIDEO
CODING
STANDARDS:JPEG A N D h4PEG
(where objects from a number of sources are combined into a scene) and a high degree of
interactivity. The basic concept used in MPEG-4 Visual is that of the video object (VO). A
video scene (VS) (a sequence of video frames) is madeup of a number of VOs. For example,
the VS shown in Figure 4.14 consists of a background V 0 and two foreground VOs. MPEG4
provides tools that enable each V 0 to be coded independently, opening up a range of new
possibilities. The equivalent of a frame in V 0 terms, i.e. a snapshot of a V 0 at a single
instant in time, is a video object plane (VOP). The entire scene may be coded as a single,
rectangular VOP and this is equivalent to a picture in MFEG-1 and MPEG-2 terms.
Toolkit-basedcoding MPEG-l has a very limited degree of flexibility; MPEG-2 introduced the concept of a toolkit of profiles and levels that could be combined in different
ways for various applications. MPEG-4 extends this towards a highly flexible set of coding
tools that enable a range of applications aswell as a standardised framework that allows new
tools to be added to the toolkit.
The MPEG-4 standard is organised so that new coding tools and functionalities may be
added incrementally asnew versions of the standard are developed, andso the list of tools
continues to grow. However, the main tools forcoding of video images can be summarised
as follows.
Input format Video data is expected to be pre-processed and converted to one of the
picture sizes listedin Table 4.2, at a frame rateof up to 30 framesper second and in 4 :2 :0
Y: Cr :Cb format (i.e. the chrominance components have half the horizontal and vertical
resolution of the luminance component).
Picture types Each frame is coded as an I- or P-frame. An I-frame contains only intracoded macroblocks, whereas a P-frame can contain either intra- inter-coded
or
macroblocks.
69
(luminance)size
Picture Format
128 x
176 x
352 x
704 x
1408 x
SubQCIF
QCIF
CIF
4CIF
16CIF
96
144
288
576
1152
Motionestimationandcompensation
Thisiscarried out on 16 x 16 macroblocks or
(optionally) on 8 x 8 macroblocks. Motion vectors can have half-pixel resolution.
Transformcoding The motion-compensated residual iscoded with DCT, quantisation,
zigzag scanning and run-level coding.
Variable-length coding The run-level coded transform coefficients, together with header
information andmotion vectors, arecoded using variable-length codes. Each non-zero
transform coefficient is coded as a combination of run, level, last (where last is a flag to
indicate whether this is the last non-zero coefficient in the block) (see Chapter 8).
Syntax
The syntax of an MPEG-4 (VLBV) coded bit stream is illustrated in Figure 4.15
Picture layer The highest layer of the syntax contains a complete coded picture. The picture
header indicates the picture resolution, the type of coded picture (inter or intra) and includes
a temporal reference field.This indicates the correct display time
for the decoder (relative to other
coded pictures) and can help to ensure that a picture is not displayed too early or too late.
Picture Cr
...
...
Group of Blocks
...
Picture 1
Macroblock
Figure 4.15
...
MPEG-4/H.263 layeredsyntax
70
...
...
...
GOB 17
GOB 6
GOB 7
GOB 8
(a) CIF
(b) QCIF
Group of blocks layer A group of blocks (GOB) consists of one complete row of macroblocks in SQCF, QCIF and CIF pictures (two rows in a 4CIF picture and four rows in a 16CIF
picture). GOBs are similar to slices in MPEG-1 and MPEG-2 in that, if an optional GOB
header is inserted in the bit stream, the decoder can resynchronise to the start of the next
GOB if an error occurs. However, the size and layout of each GOB are fixed by the standard
(unlike slices). The arrangement of GOBs in a QCIF and CIF picture is shown in Figure 4.16.
Macroblock layer A macroblock consists of four luminance blocks and two chrominance
blocks. The macroblock header includes information about the type of macroblock, coded
block pattern (indicating which of the six blocks actually contain transform coefficients)
and coded horizontal and vertical motion vectors (for inter-coded macroblocks).
Blocklayer A block consists of run-level coded coefficients corresponding to an 8 x 8
block of samples.
The core CODEC (based on H.263) was designed for efficient coding at low bit rates. The
use of 8 x 8 block motion compensation and the design of the variable-length coding tables
make the VLBV MPEG-4 CODEC more efficient than MPEG-I or MPEG-2 (see Chapter 5
for a comparison of coding efficiency).
Figure4.17
71
illustrates the concept of opaque and semi-transparent VOPs: in image (a), VOP2 (foreground) isopaque and completely obscures VOPl(background), whereas in image(b)
VOP2 is partly transparent.
Binary shape informationis coded in 16 x 16 blocks (binary alpha blocks, BABs). There
are three possibilities for each block
2. All pixels are opaque, i.e. the block is fully inside the VOP. No shape information is
coded: the pixel values of the block (texture) are coded as described
in the next section.
72
VIDEO
CODING
STANDARDS:
MPEG
AND
JPEG
3. Some pixels are opaqueand some are transparent, i.e. the block crosses boundary
a
of the
VOP. The binary shape values of each pixel (1 or 0) are coded using a form of DPCM and
the texture information of the opaque pixels is coded as described below.
Grey scale shape information produces values in the range 0 (transparent) to 255 (opaque)
that are compressed using block-based DCT and motion compensation.
Motion compensation Similar options exist to the I-, P- and B-pictures in MPEG-1 and
MPEG-2:
1. I-VOP: VOP is encoded without any motion compensation.
Figure 4.18 shows mode (3), prediction of a B-VOP from a previous I-VOP and future
P-VOP. For macroblocks (or 8 x 8 blocks) that are fully contained within the current and
reference VOPs, block-based motion compensation is used in a similar way to MPEG- 1 and
MPEG-2. The motion compensation process ismodified for blocksor macroblocks along the
boundary of the VOP.In the reference VOP, pixels in the 16 x 16 (or 8 x 8) search area
are padded based on the pixels alongthe edge of the VOP. The macroblock (or block)in the
current VOP is matched with this search area using block matching: however, the difference
value (mean absolute error or sum
of absolute errors) is only computed for those pixel
positions that lie within the VOP.
Texture coding Pixels (or motion-compensated residual values) within a VOP are coded
as texture. The basic tools are similarto MPEG-1 and MPEG-2: transform using the DCT,
quantisation of the DCT coefficients followed by reordering and variable-length coding. To
further improve compression efficiency, quantised DCT coefficients may be predicted from
previously transmitted blocks (similar to the differential prediction of DC coefficients used
in JPEG, MPEG-1 and MPEG-2).
B-VOP
73
A macroblock that covers a boundary of the VOP will contain both opaque andtransparent
pixels. In order to apply a regular 8 x 8 DCT, it is necessary to use padding to fill up the
transparent pixel positions. In an inter-coded VOP, where the textureinformation is motioncompensated residual data, the transparent positions are simply filled with zeros. In an intracoded VOP, where the texture is original pixel data, the transparent positions are filled by
extrapolating the pixel values along the boundary of the VOP.
Errorresilience
MPEG-4incorporates
a number of mechanisms that can provide
improvedperformance in the presence of transmission errors (such as bit errorsor lost
packets). The main tools are:
1. Synchronisation markers: similar to MPEG-1 and MPEG-2 slice start codes, except that
these may optionally be positioned so that each resynchronisation interval contains an
approximately equal number of encoded bits (rather than a constant number of macroblocks). This means that errors are likely to be evenly distributed among the resynchronisation intervals. Each resynchronisation interval may be transmitted in a separate video
packet.
2. Data partitioning: similar to the data partitioning mode of MPEG-2.
3. Header extension: redundant copies of header information are inserted at intervals in the
bit stream so that if an important header(e.g. a picture header) is lost due to anerror, the
redundant header may be used to partially recover the coded scene.
4. Reversible VLCs: these variable lengthcodes limit the propagation (spread) ofan
errored region in a decoded frame or VOP and are described further in Chapter 8.
Scalability MPEG-4 supports spatial and temporal scalability. Spatial scalability applies to
rectangular VOPs in a similar way to MPEG-2: the base layer gives a low spatial resolution
and an enhancement layer may be decoded together with the base layer to give a higher
resolution. Temporal scalability is extended beyondthe MPEG-2 approach in that it may be
applied to individual VOPs. For example, a background VOP may beencoded without
scalability, whilst a foreground VOP may beencoded with several layers of temporal
scalability. This introduces the possibility of decoding a foreground object at a higher frame
rate and more static, background objects at a lower frame rate.
Sprite coding A sprite is a VOP that is present for the entire duration of a video sequence
(VS). A sprite may be encoded and transmitted once at the start of the sequence, giving a
potentially large benefitin compression performance. A goodexampleis a background
sprite: the background image to a scene is encoded as a sprite at the start of the VS. For the
remainder of the VS, only the foreground VOPs need to be coded and transmitted since the
decoder can render the background from the original sprite. If there is camera movement
(e.g. panning), then a sprite that is larger than the visible scene is required (Figure 4.19). In
order to compensate for more complex camera movements
(e.g. zoom or rotation), it may be
necessary for the decoder to warp the sprite. A sprite is encoded as an I-VOP as described
earlier.
Static texture An alternative set of tools to the DCT may be used to code static texture,
i.e. texture data that doesnot change rapidly. The main application for this is to codetexture
74
techniques including:
0
3-D mesh coding, where an object is described as a mesh in 3-D space. This is more
complex than a 2-D mesh representation but gives a higher degree of flexibility in terms
of representing objectswithin a scene.
Face and body model coding, where a human face or body is rendered at the decoder
accordingto a faceor body model. Themodeliscontrolled(moved)
by changing
animation parameters. In this way a head-and-shoulders video scene may be coded by
sending only the animation parameters
required to move the model at the decoder. Static
texture is mapped onto the model surface.
Thesethreetoolsoffer
the potentialforfundamentalimprovements
in video coding
performance and flexibility: however, their application is currently limited because of the
high processing resources required to analyse and render even a very simple scene.
75
(MOVING
PICTURE
EXPERTS
MPEG
GROUP)
MPEG-4 standard. Each profile is defined in terms of one or more object types, where an
object type is a subset of the MPEG-4 tools. Table 4.3 lists the main MPEG-4 object types
that make up the profiles. The Simple object type contains tools for coding of basic I- and
P-rectangular VOPs (complete frames) together with error resilience tools and the short
header option (for compatibility with H.263). The Core type adds B-VOPs and basic shape
coding (using a binary shape mask only). The main profile adds grey scale shape coding and
sprite coding.
MPEG-4 (Visual) is gaining popularity in a number of application areas such as Internetbased video. However, to date the majority of applications use only the simple object type
and there has been limited take-up of the content-based features of the standard. This is
partly because of technical complexities (for example, it is difficult to accurately segment a
video scene into foreground and background objects, e.g. Figure 4.14, using an automatic
algorithm) and partly because useful applications for content-based video coding and
manipulation have yet to emerge. At the time of writing, the great majority of video coding
applications continue to work with complete rectangular frames. However, researchers
continue to improve algorithms for segmenting and manipulating video
The
content-based tools have a number of interesting possibilities: for example, they make it
Table 4.3
MPEG-4videoobjecttypes
Video object types
Basic
Still
SimpleAnimatedanimatedscalableSimple
2-D mesh
texture
texture
face
Visual
tools
Simple
Core
Main
scalable
Basic (I-VOP, P-VOP,
coefficient prediction,
16 x 16 and 8 x 8
motion vectors)
Error resilience
Short header
B-VOP
P-VOP with overlapped
block matching
Alternative quantisation
P-VOP based temporal
scalability
Binary shape
Grey shape
Interlaced video coding
Sprite
Rectangular temporal
scalability
Rectangular spatial
scalability
Scalable still texture
2-D mesh
Facial animation parameters
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
76
VIDEO CODING
STANDARDS:
AND
JPEG
MPEG
4.5 SUMMARY
The I S 0 has issued a number of image and videocodingstandards that have heavily
influenced the development of the technology and market for video codingapplications. The
original JPEG still image compression standard is now a ubiquitous method for storing and
transmitting still images and has gained somepopularity as a simple androbust algorithm for
video compression. The improved subjective and objective performance of its successor,
JPEG-2000, may lead to the gradual replacement of the original JPEG algorithm.
The first MPEG standard, MPEG-l, was never a market success in its target application
(video CDs) but is widely used for PC and internet video applications and formed the basis
for the MPEG-2 standard. MPEG-2 has enableda worldwide shift towards digital television
andis probably the most successful of thevideocodingstandards
in terms of market
penetration. The MPEG-4 standard offers a plethora of video coding toolswhich may in time
enable many new applications: however, at the present time the most popular element of
MPEG-4 (Visual) is the core low bit rateCODEC that is based on the ITU-TH.263
standard. In the next chapter we will examine the H . 2 6 ~series of coding standards, H.261,
H.263 and the emerging H.26L.
REFERENCES
1. http://www.itu.int/ [International Telecommunication Union].
2. http://www.iso.ch/ [International Standards Organisation].
3. ISO/IEC 10918-1 /ITU-T Recommendation T.81, Digital compression and coding of continuoustone still images, 1992 [JPEG].
4. ISO/IEC 11 172-2, Information technology-coding of moving pictures and associated audio for
digital storage media at up to about 1.5 Mbit/s-part 2: Video, 1993 [MPEGl Video].
5 . ISOlIEC 138 18-2, Information technology: generic coding
of moving pictures and associated audio
information: Video, 1995 [MPEG2 Video].
6. ISO/IEC 14996-2, Information technology-coding of audio-visual objects-part 2: Visual, 1998
[MPEG-4 Visual].
7. ISO/IEC FCD 15444-1, JPEG2000 Final Committee Draft v1 .O, March 2000.
8. ISO/IEC JTCl/SC29/WG1 l N403 1, Overview of the MPEG-7 Standard, Singapore, March 200
1.
9. ISO/IEC JTCl/SC29/WG11 N4318, PEG-21 Overview, Sydney, July 2001.
10. http://standards.pictel.com/ftp/video-site/
[VCEG working documents].
I 1. http://www.cselt.it/mpeg/ [MPEG committee official site].
12. http://www.jpeg.org/ [JPEG resources].
13. http://www.mpeg.org/ [MPEG resources].
14. ITU-T Q6/SG16 Draft Document, Appendix I11 for ITU-T Rec H.263, Porto Seguro, May 2001.
15. A. N. Skodras, C. A. Christopoulosand T. Ebrahimi,JPEG2000: The upcoming still image
compression standard, Proc. 11th Portuguese Conference on Pattern Recognition, Porto, 2000.
16. ISO/IEC 11 172-1, Information technology-coding of moving pictures and associated audio for
digital storage media at up to about 1.5 Mbit/s-part 1: Systems, 1993 [MPEGI Systems].
REFERENCES
77
17. ISO/IEC 11172-2, Information technology-coding of moving pictures and associated audio
for
digital storage mediat at up to about lSMbit/s-part 3: Audio, 1993 [MPEGl Audio].
18. ISO/IEC 138 18-3, Information technology: generic coding
of moving pictures and associated audio
information: Audio, 1995 [MPEG2 Audio].
19. ISO/IEC 138 18-1, Information technology: generic coding
of moving pictures and associated audio
information Systems, 1995 [MPEG2 Systems].
20. P. Salembier andF. MarquCs, Region-based representationsof image and video: segmentation tools
for multimedia services, IEEE Trans. CSVT 9(8), December 1999.
21. L. Garrido, A. Oliveras and P. Salembier, Motion analysis of image sequences using connected
operators, Proc. VCIP97, San Jose, February 1997, SPIE 3024.
22. K. Illgner and F. Muller, Image segmentation using motion estimation, in
Erne-varying Image
Processing and Image Recognition, Elsevier Science, 1997.
23. R. Castagno and T. Ebrahimi,VideoSegmentationbasedonmultiplefeaturesforinteractive
multimedia applications, IEEE Trans. CSVT 8(5), September, 1998.
24. E. Steinbach, P. Eisert and B. Girod, Motion-based analysis and segmentationof image sequences
using 3-D scene models, Signal Processing, 66(2), April 1998.
25. M. Chang, M. Teklap and M. Ibrahim Sezan, Simultaneous motion estimation and segmentation,
IEEE Trans. Im. Proc., 6(9), 1997.
5
Video Coding Standards:
H.261, H.263 and H.26L
5.1 INTRODUCTION
The I S 0 MPEG video coding standards are aimed at storage and distribution of video for
entertainment and have tried to meet the needs of providers and consumers in the media
industries. The ITU has (historically) been more concerned about the telecommunications
industry, anditsvideocodingstandards
(H.261,H.263, H.26L)haveconsequentlybeen
targeted at real-time, point-to-point or multi-point communications.
The first ITU-T video coding standard to havea significant impact,H.26 I , was developed
during the late 1980s/early 1990s with a particular application and transmission channel in
mind. The application was video conferencing (two-way communicationsvia a video link)
and the channelwas N-ISDN. ISDN providesa constant bit rate o f p X 64 kbps, wherep is an
integer in the range 1-30: it was felt at the time that ISDN would be the medium of choice
forvideocommunicationsbecause
of its guaranteedbandwidth andlow delay. Modem
channels over the analogue POTSPSTN (at speeds of less than 9600 bps at the time) were
considered to be too slow for visual communications and packet-based transmissionwas not
considered to be reliable enough.
H.261 was quite successful and continues to be used in many legacy video conferencing
applications.Improvements in processorperformance,videocodingtechniques
and the
emergence of analogue Modems and Internet Protocol (IP) networks as viable channels led
tothedevelopment of its successor,H.263, inthe mid-1990s. By making a number of
improvements to H.261, H.263 provided
significantly better compression performance as
well as greater flexibility. The original H.263 standard (Version 1) had four optional modes
which could be switched on to improve performance (at the expense
of greater complexity).
Thesemodeswereconsideredtobeuseful
andVersion 2(H.263+)added12further
optional modes. The latest (and probably the last) version (v3) will contain a total of 19
modes, each offering improved coding performance, error resilience and/or flexibility.
Version 3 of H.263 has becomea rather unwieldy standard because of the large numberof
options and the need to continue to support the basic (baseline) CODEC functions. The
latest initiative of the ITU-T experts group VCEG is the H.26L standard (where L stands
for long term). This isa new standard that makes use of some of the best features of H.263
and aims to improve compression performance
by around 50% at lower bit rates.
Early
indications are that H.26L will outperform H.263+ (but possibly not by 50%).
80
VIDEO
CODING
STANDARDS:
H.261,
H.263
AND
H.26L
5.2 H.261
Typical operating bit rates for H.261 applications are between 64 and 384 kbps. At the time
of development,packet-basedtransmissionover
the Internetwasnotexpected
to be a
significant requirement, and the limited video compression performance achievable at
the
time was not considered to be sufficient to support bit rates below 64 kbps.
A typical H.261 CODEC is very similar to the generic motion-compensated DCT-based
CODEC described in Chapter 3. Video data is processed in 4 : 2 :0 Y: Cr :Cb format. The
basic unit isthemacroblock,containingfourluminanceblocksandtwochrominance
blocks (each 8x 8 samples) (see Figure 4.6).At the input to the encoder, 16
x 16 macroblocks
may be(optionally)motioncompensated
using integermotionvectors.
The motioncompensated residual data is coded with an 8x 8 DCT followed by quantisation and zigzag
reordering. The reordered transform coefficients are run-level coded and compressed with
an entropy encoder (see Chapter 8).
Motion compensation performance is improved by use of an optional loop jilter, a 2-D
spatial filter that operates on each8 x 8 block in a macroblock prior to motion compensation
(if the filter is switched on). The filter has the effect of smoothing the reference picture
which can help to provide a better prediction reference. Chapter 9 discusses loop filters in
more detail (see for example Figures 9.1 1 and 9.12).
Inaddition, a forwarderrorcorrectingcodeis
defined in thestandardthatshouldbe
inserted into the transmitted bit stream. In practice, this code is often omitted frompractical
implementations of H.261:theerrorrate
ofan ISDNchannelis low enoughthaterror
correction is not normally required, and the code specified in the standard is not suitable for
other channels (such as a noisy wireless channel or packet-based transmission).
Each macroblock may be coded in intra mode (no motion-compensated prediction) or
inter mode (with motion-compensated prediction). Only two frame sizes are supported,
CIF (352 x 288 pixels) and QCIF (176 x 144 pixels).
H.261 was developed at a time when hardware and software processing performance was
limitedandthereforehastheadvantage
oflow complexity.However,itsdisadvantages
include poor compression performance (with poor video quality at bit rates of under about
100kbps) andlack of flexibility. It has beensuperseded by H.263,whichhashigher
compressionefficiency and greater flexibility, butisstillwidely
used in installedvideo
conferencing systems.
5.3 H.2632
In developing the H.263 standard, VCEG aimed
to improve upon H.261 in a numberof areas.
By taking advantage of developments in video coding algorithms and improvements in processing performance, it provides better compression. H.263 provides greater flexibility than
H.261: for example, a wider range of frame sizes is supported (listedin Table 4.2). The first
version of H.263 introduced four optional modes, each described in an annex
to the standard, and
further optional modes were introduced in Version 2 of the standard (H.263f). The target
application of H.263 is low-bit-rate, low-delay two-way video communications. H.263 can
support video communications at bit rates below 20 kbps (at a very limited visual quality)
and is now widely used both in established applications suchas video telephony and video
conferencing and an increasing number of new applications (such as Internet-based video).
81
5.3.1
Features
The baseline H.263 CODEC is functionally identical to the MPEG-4 short header CODEC
described in Section 4.4.3. Input frames in 4 : 2 : 0 format are motion compensated (with half-pixel
resolution motion vectors), transformed with an 8 x 8 DCT, quantised, reordered and entropy
coded. The main factors that contribute to the improved coding performance over H.26 1 are the use
of half-pixel motion vectors (providing better motion compensation) and redesigned variablelength code (VLC) tables(described further in Chapter 8). Features such as I- and P-pictures,
more frame sizesand optional coding modes give the designer greaterflexibility to deal with
different application requirements and transmission scenarios.
Annex D, Unrestrictedmotionvectors
The optionalmode described in Annex D of
H.263 allows motion vectors to point outside the boundaries of the picture. This canprovide
a coding performance gain, particularly if objects are moving intoor out of the picture. The
pixels at the edges of the picture are extrapolated to form a border outside the picture that
vectors may point to (Figure 5.1). In addition, the motion vector range is extended so that
Figure 5.1
Unrestrictedmotionvectors
82
longer vectors are allowed.Finally, Annex D contains an optional alternative setof VLCs for
encoding motion vector data. These VLCs are reversible, making it easier to recover from
transmission errors (see Chapter 11).
Annex E, Syntax-based arithmetic coding Arithmetic coding is used instead of variablelength coding. Each of the VLCs defined in the standard is replaced with a probability value
that is used by an arithmetic coder (see Chapter 8).
Annex F, Advanced prediction The efficiency of motion estimation and compensation is
improved by allowing the use of four vectors per macroblock (a separate motion vector for
each 8 x 8 luminance block, Figure 5.2). Overlapped block motion compensation(described
inChapter 6) is used to improve motion compensation and reduce blockiness in the
decoded image. Annex F requirestheCODEC
to support unrestricted motion vectors
(Annex D).
Annex G, PB-frames A PB-frame is a pair of frames coded as a combined unit. The first
frame iscoded as a B-picture and the second as a P-picture. The P-picture is forward
predicted from the previous I- or P-picture and the B-picture is bidirectionally predicted
from the previous and current I- or P-pictures. Unlike MPEG-I (where a B-picture is coded
as a separate unit), each macroblock of the PB-frame contains data from both the P-picture
and theB-picture(Figure
5.3). PB-framescangive
an improvement in compression
efficiency.
Annex I, Advanced intra-coding Thismodeexploits
the correlation between DCT
coefficients in neighbouring intra-coded blocks in an image. The DC coefficient and the
first row or column ofAC
coefficients may be predicted from the coefficients of
neighbouring blocks (Figure 5.4). The zigzag scan, quantisation procedure
and variablelength code tables are modified and the result is an improvement in compression efficiency
for intra-coded macroblocks.
Annex J, Deblocking filter The edges of each 8 x 8 block are smoothed using a spatial
filter (described in Chapter 9). This reduces blockiness in the decoded picture and also
improves motion compensation performance. When the deblocking
filter is switched on, four
P macroblock
data
83
B macroblock
data
for resynchronisation
of codedmacroblocks
Current block
84
VIDEO
CODING
STANDARDS:
H.261,
H.263
AND
H.26L
(b)
rectangular
slices
Arbitrary
Annex L, Supplementalenhancementinformation
Thisannexcontainsanumber
of
supplementary codes that may be sent by an encoder to a decoder. These codes indicate
display-relatedinformationaboutthevideosequence,
such aspicturefreezeandtiming
information.
Annex M, Improved PB-frames As the name suggests, this is an improved version of the
original PB-frames mode (Annex G). Annex M adds the options of forward or backward
prediction for the B-frame part of each macroblock (as well as the bidirectional prediction
defined in Annex G), resulting in improved compression efficiency.
Annex N, Reference picture selection This mode enables an encoder to choose from a
number of previously coded pictures for predicting the current picture.
The use of this mode
to limit error propagation ina noisy transmission environment is discussedin Chapter 1 1 . At
the start of each GOB or slice, the encoder may choose the preferred reference picture for
prediction of macroblocks in that GOB or slice.
Annex 0, Scalability Temporal, spatial and SNR scalability are supported by this optional
mode. In a similar way to the MPEG-2 optional scalability modes, spatial scalability increases frame resolution, SNR scalability increases picture quality and temporal scalability
increasesframerate.
In eachcase,a
base layerprovidesbasicperformance
and the
increased performance is obtained by decoding the base layer together
with an enhancement
layer. Temporal scalability is particularly useful because it supports B-pictures: these are
similar to the true B-pictures in the MPEG standards (where a B-picture
is a separate coded
unit) and are more flexible than the combined PB-frames described in Annexes G and M.
85
Annex P, Referencepictureresampling
Thepredictionreferenceframe
usedbythe
encoder and decoder may beresampledpriortomotioncompensation.Thishasseveral
possible applications. For example, an encoder can change the frame resolution on the fly
whilst continuing to use motion-compensated prediction. The prediction reference frame is
resampled to match thenew resolution and the current frame can thenbe predicted from the
resampled reference. This mode
may also beused to support warping, i.e. the reference
picture is warped (deformed)prior to prediction,perhapstocompensatefornonlinear
camera movements such as zoom or rotation.
Annex Q, Reduced resolution update An encoder may choosetoupdateselected
macroblocksat a lowerresolution than thenormalspatialresolution
ofthe frame.This
may be useful, for example, to enable a CODEC to refresh moving partsof a frame at alow
resolution using a small number of coded bits whilst keeping the static parts of the frame at
the original higher resolution.
Annex R, Independent segment decoding
This annex extends the concept of the independently decodeable slices (AnnexK) or GOBs. Segments of the picture (where a segment
is one slice or an integral number of GOBs) may be decoded completely independently of
any other segment.In the slice structured mode (AnnexK), motion vectors can point to areas
of thereferencepicturethatareoutside
the currentslice;withindependentsegment
decoding, motion vectors and other predictions can only reference areas within the current
segment in the reference picture (Figure 5.6). A segment can be decoded (over a series of
frames) independently of the rest of the frame.
Annex S, Alternative inter-VLC The encoder may use an alternative variable-length code
table for transform coefficients in inter-coded blocks. The alternative VLCs
(actually the
same VLCs used for intra-coded blocks in Annex I) can provide better coding efficiency
when there are a large number of high-valued quantised DCT coefficients (e.g. if the coded
bit rate is high and/or there is a lot of variation in the video scene).
Annex T, Modifiedquantisation Thismodeintroducessomechangestothe
waythe
quantiser and rescaling operations are carried out. Annex T allows the encoder to change
the
Figure 5.6
Independentsegments
86
VIDEO
CODING
STANDARDS:
H.261,
H.263
AND
H.26L
quantiser scale factor in a more flexible way during encoding, making it possible to control
the encoder output bit rate more accurately.
Annex U, Enhanced reference picture selection Annex U modifies the reference picture
selection modeof Annex N to provide improved error resilienceand coding efficiency. There
are a number of changes, including a mechanism to reduce the memory requirements for
storing previously coded pictures and the ability to select a reference picture for motion
compensation on a macroblock-by-macroblock basis. This means that the best match for
each macroblock may be selected from any of a number of stored previous pictures (also
known as long-term memory prediction).
Annex V, Datapartitioned slice Modified from Annex K, this mode improves the
resilience of slice structured data to transmission errors. Within each slice, the macroblock
data is rearranged so that all of the macroblock headers are transmitted first, followed by all
of the motion vectors andfinally by all of the transform coefficient data. An error occurring
in header or motion vector datausually has a more serious effect on the decodedpicture than
an error in transform coefficient data: by rearranging the data in this way, an error occurring
part-way through a slice should only affect the less-sensitive transform coefficient data.
Annex W, Additional supplemental enhancement information
Two extra enhancement
information items are defined (in addition to those defined in Annex L). The fixed-point
IDCT function indicates that an approximateinverse DCT (IDCT) may be used rather than
the exact definition of the IDCT given in the standard: this can be useful for low-complexity
fixed-point implementations of the standard. Thepicture message function allows the
insertion of a user-definable message into the coded bit stream.
5.4.1
H.263 Profiles
It is very unlikely that all 19 optional modes will be required for any oneapplication.
Instead, certain combinations of modes may be useful for particular transmission scenarios.
In commonwith MPEG-2 and MPEG-4, H.263
defines a set of recommended projiles (where
a profile is a subset of the optional tools) and levels (where a level sets a maximum value on
certain coding parameters such as frame resolution, frame rate and bit rate). Profiles and
levels are defined in the final annex of H.263, Annex X. There are a total of nine profiles, as
follows.
Profile 0, Baseline This is simply the baseline H.263 functionality, without any optional
modes.
Profile 1, Coding efficiency (Version 2) This profile provides efficient coding using only
tools available in Versions I and 2 of the standard (i.e. up to Annex T). The selectedoptional
modes are Annex I (Advanced Intra-coding), Annex J (De-blocking Filter), Annex L
(Supplemental
Information:
only the full picture freezefunctionissupported)
and
Annex T (Modified Quantisation). Annexes I, J and T provide improved coding efficiency
compared with thebaselinemode.Annex
J incorporatesthe best features of the first
version of the standard, four motion
vectors per macroblock andunrestricted motion vectors.
H.26L
87
Profile 2, Coding efficiency (Version 1) Only tools available in Version 1 of the standard
are used in this profile and in fact only Annex F (Advanced Prediction) is included. The
other three annexes (D, E, G) from the original standard are not (with hindsight) considered
to offer sufficient coding gains to warrant their use.
Profiles 3 and 4, Interactive and streaming wireless These profiles incorporate efficient
coding tools (Annexes I,J and T) together with the slice structured mode (AnnexK) and, in
the case of Profile 4, the data partitioned slice mode (Annex
V). These slice modes can
support increasederrorresiliencewhichisimportantfor
noisy wirelesstransmission
environments.
Profiles 5,6, 7,Conversational These three profiles support low-delay, high-compression
conversational applications (suchas video telephony).Profile 5 includes tools that provide
efficient coding; Profile 6 adds the slice structured mode (Annex K) for Internet conferencing; Profile 7 adds support for interlaced camera sources (part of Annex W).
Profile 8, High latency For applications that can tolerate a higher latency (delay), such as
streaming video, Profile 8 adds further efficient coding tools such as B-pictures (Annex 0)
and reference picture resampling (Annex P). B-pictures increase coding
efficiency at the
expense of a greater delay.
The remaining tools within the 19 annexes are not included in any profile, either because
they are considered to be too complex for anything otherthan special-purpose applications,
or because more efficient tools have superseded them.
5.5
H.26L3
The 19 optional modes of H.263 improved coding efficiency and transmission capabilities:
however, development of H.263 standard is constrained by the requirement to continue to
support the original baseline syntax. The latest standardisation effortby the Video Coding
Experts Group is to develop a new coding syntax that offers significant benefits over the
older H.261 and H.263 standards. This
new standard is currently described
as H.26L, where
the L stands for long term and refers to the fact that this standard was planned as a longterm solution beyond the near-term additions to H.263 (Versions 2 and 3).
The aim of H.26L is to provide a next generation solution for video coding applications
offering significantly improved coding efficiency whilst reducing the clutter of the many
optionalmodes in H.263.The new standard alsoaimstotakeaccount
of the changing
nature of video coding applications. Early applications of H.261 used dedicated CODEC
hardware over thelow-delay, low-error-rate ISDN. The recenttrend is towards software-only
ormixed softwarehardwareCODECs(wherecomputationalresources
arelimited,but
greater flexibility is possible than with a dedicated hardware CODEC)and more challenging
transmissionscenarios (such aswireless links with high errorratesandpacket-based
transmission over the Internet).
H.26L is currently at the test model development stage
and may continue to evolve before
standardisation. The main features can be summarised as follows.
88
4x4
10
11
12
I 13
14
15
23
24
25
4x4
coefficients
DC
2x2
Figure 5.7
ml l
22
2x2
89
H.26L
Mode 1
Mode 2
Mode 3
12
Mode 4
Mode 5
Mode 6
13
14
15
Mode 7
4 x 4 Block transform After motion compensation, the residual datawithin each block is
transformed using a 4 x 4 block transform. This is based on a 4 x 4 DCT but is an integer
transform (rather than the floating-point true DCT). An integer transform avoids problems
caused by mismatches between different implementationsof the DCT and is well suited to
implementation in fixed-point arithmetic units (such as low-power embedded processors,
Chapter 13).
Universalvariable-lengthcode
The VLC tables in H.263arereplaced with a single
universal VLC.A transmitted code is created by building upa regular VLC from the universal
codeword.Thesecodes
have twoadvantages: they can be implemented efficiently in
software without the need for storage
of large tables and they are reversible, making it
easier to recover from transmission errors (see Chapters 8 and 11 for further discussion of
VLCs and error resilience).
Content-based adaptive binary arithmetic coding This alternative entropy encoder uses
arithmetic coding (described
in Chapter 8) to give higher compressionefficiency than variablelength coding.In addition, the encoder can adapt to local image statistics, i.e. it can generate
and use accurate probability statistics rather than using predefined probability tables.
B-pictures These are recognised to be a very useful coding tool, particularly for applications that are not very sensitive to transmission delays. H.26L supports B-pictures
in a similar
way to MPEG-l and MPEG-2, i.e. there is no restriction on the number of B-pictures that
may be transmitted between pairs of I- and/or P-pictures.
At the time of writing it remains to be seen whether H.26L will supersede the popular
H.261andH.263standards.
Early indicationsarethatitoffers
areasonablyimpressive
performance gain over H.263 (see the next section): whether these gains are
sufficient to
merit a switch to the new standard is not yet clear.
90
41
H.26L
39
H.263 l
MPEG-2
MPEG4
37
=m 35
h
/
MJPEG
a 33
a,
31
29
27
300
250
0200
50
150
100
91
SUMMARY
plication
Coding
Standard
MJPEG
H.261
MPEG- 1
Target
Features
Image coding
Video conferencing
Video-CD
1 (worst)
2
3 (equal)
TV Digital
MPEG-2
3 (equal)
conferencing
Video
H.263
4 (equal)
(equal)
MPEG-4
Multimedia
4 coding
5 (best)
conferencing
Video
H.26L
5.7
SUMMARY
The ITU-T Video Coding Experts Group developed the H.261 standard for
video conferencing applications which offered reasonable compression performance
with relatively low
complexity. This was superseded
by the popular H.263 standard, offering better performance
through featuressuch as half-pixel motion compensationandimproved Variable-length
coding. Two further versionsof H.263 have been released, each offering additional optional
coding modes to support better compression efficiency and greater flexibility. The latest
version (Version 3) includes 19 optional modes, but is constrained by the requirement to
support the original, baseline H.263 CODEC. The H.26L standard,
under development
atthetime
of writing,incorporatesanumber
of new codingtools such as a 4 x 4
blocktransform and flexible motion vector optionsandpromises to outperformearlier
standards.
Comparing the performance of the various coding standards is difficult because a direct
rate-distortioncomparisondoes
not takeintoaccountotherfactors
such as features,
flexibility and market penetration. It seems clear that the H.263, MPEG-2 and MPEG-4
standards eachhave their advantages for designers
of video communication systems. Each
of
thesestandardsmakesuse
of commoncodingtechnologies:motionestimationand
compensation, block transformation and entropy coding. In the next section of this book
we will examine these core technologies in detail.
92
VIDEO
CODING
STANDARDS:
REFERENCES
1. ITU-T Recommendation H.261, Video CODEC for audiovisual services at px64kbit/s, 1993.
2. ITU-T Recommendation H.263, Video coding for low bit rate communication, Version 2, 1998.
3. ITU-T Q6/SG16 VCEG-L45, H.26L Test Model Long Term Number 6 (TML-6) draft
0, March
200 1.
4. ITU-T Q6/SG16 VCEG-MO8, Objective coding performance of [H.26L] TML 5.9 and H.263+,
March 200 1.
Motion Estimation
and Compensation
6.1 INTRODUCTION
In the video coding standards described in Chapters 4 and 5, blocks of image samples or
residual data are compressed using a block-based transform (such as the DCT) followed by
quantisationandentropyencoding.There
is limitedscopeforimprovedcompression
performance in the later stages of encoding (DCT, quantisation and entropy coding), since
the operation of the DCT and the codebook for entropy coding are specified by the relevant
video coding standard. However, there is scope for significant performance improvement in
thedesign of the first stage of avideoCODEC(motionestimationandcompensation).
Efficient motion estimation reduces the energy in the motion-compensated residual frame
andcandramaticallyimprovecompressionperformance.Motionestimationcan
bevery
computationally intensive and so this compression performance may be at the expense of
high computational complexity. This chapter describes the motion estimation and compensation process in detail and discusses implementation alternatives and trade-offs.
The motion estimation and compensation functions have many implications for CODEC
performance. Key performance issues include:
0
0
Coding performance (how efficient is the algorithm at minimising the residual frame?)
Complexity (does the algorithm makeeffective use of computation resources, how easy is
it to implement in software or hardware?)
Storage and/or delay (does the algorithm introduce extra delay and/or require storage of
multiple frames?)
Side information(how much extrainformation,
transmitted to the decoder?)
Error resilience (how does the decoder perform when errors occur during transmission?)
These issues are interrelated and are potentially contradictory (e.g. better coding performance may lead to increased complexity and delay and poor error resilience) and different
solutions are appropriate for different platforms and applications. The design and implementation of motionestimation,compensationandreconstructioncan
be critical to the
performance of a video coding application.
94
COMPENSATION
ESTIMATION
MOTION
AND
Motion
compensation
Current frame
Encode residual
$.
Reconstructed
frame
Reconstruction 4
Decode residual
block diagram
ESTIMATION
MOTION
(c)
AND COMPENSATION
95
96
Current
Reference
block
Positions (x,y)
area
current block and the same position in the reference frame (position (0, 0)) is given by
{(l - 4)*
+ (3
+ (3
+ (2
2) + ( 5
2)2
+ (6
4) + (4
3)2
4)* + (4 - 2)
3)
+ (3
3)2}/9 = 2.44
The complete set of MSE values for each search position is listed in Table 6.1 and shown
graphically in Figure 6.4. Of the nine candidate positions, ( - 1, l ) gives the smallest MSE
and hence the best match. In this example, the best model for the current block (i.e. the
best prediction) is the 3 x 3 region in position ( - l , 1).
A video encoder carries out this process for each block in the current frame
1. Calculate the energyof the difference betweenthe current block and a set of neighbouring
5. Encode and transmit a motion vector that indicates the position of the matching region,
relative to the current block position (in the above example, the motionvector is ( - I , 1).
Steps 1 and 2 above correspond to motion estimation and step 3 to motion compensation.
The video decoder reconstructs the block as follows:
1. Decode the difference block and motion vector.
2. Add the difference block to the matching region in the reference frame (i.e. the region
pointed to by the motion vector).
Table6.1
Position(x,y) (-1,
-1)
MSE
4.67
2.78
(0, - 1 )
2.89
(1, - 1 )
(-1,O)
2.44
3.22
(0,O)
(1, 0) (-1,
3.33
1)
(0, 1 )
0.22 5.33
2.56
(1, 1)
ESTIMATION
MOTION
AND COMPENSATION
97
(b)
Figure 6.4 MSE map: (a) surface plot; (b) pseudocolour plot
6.2.3 MinimisingDifferenceEnergy
The namemotionestimationismisleadingbecausetheprocessdoes
not necessarily
identify true motion, instead it attempts to find a matching region in the reference frame
l,
2t
3
45e--
6-
7
8
0
Figure 6.5
10
12
thatminimisestheenergy
of thedifferenceblock.Wherethereisclearly
identifiable
linearmotion,such
as largemovingobjectsorglobalmotion(camerapanning,etc.),
motion vectors produced in this way should roughly correspond to the movement of blocks
between the reference and the current frames. However, where the motion is less obvious
(e.g. small moving objects that
do not correspond to complete blocks, irregular motion,
etc.), the motion vector may not indicate genuine motion but rather the positionof a good
match.
Figure6.5showsthemotionvectorsproduced
by motionestimationforeach
of the
16 x 16 blocks (macroblocks)of the frame in Figure 6.2. Most of the vectors do correspond
to motion: the girl and bicycle are moving
to the left and so the vectors point to therighr (i.e.
to the region the objects have movedfrorn). There is an anomalous vector
in the middle (it is
larger than the rest and points diagonally upwards). This vector does not correspond to true
motion, it simply indicates that the best match can be found in this position.
There are many possible variations on the basic block matching process, some of which
will be described later in this chapter. Alternative measures
of DFD energy may be used
(to reduce the computation required to calculate MSE). Varying block sizes, or irregularshaped regions, can be more efficient at matching true motion than fixed 16 x 16 blocks.
A better match may be found by searching within two or more reference frames (rather than
just one). The order
of searchingneighbouringregionscanhave
a significant effecton
matching efficiency and computational complexity. Objects do not necessarily move by an
integral number of pixels between successive frames andso a better match may be obtained
by searching sub-pixel positions in the reference frame. The block matching process itself
only works well for large, regular objects with linear motion: irregular objects
and non-linear
motion(such as rotation or deformation) may bemodelledmoreaccurately
with other
motion estimation methods such as object-based or mesh-based estimation.
99
ESTIMATION
FULL
MOTION
SEARCH
Comparison criteria
Mean squared error provides a measure
of the energy remaining in the difference block.
MSE for a N x N-sample block can be calculated as follows:
where C, is a sampleof the current block, RV is a sampleof the reference area and COO,
Roo
are the top-left samples in the current and reference areas respectively.
Mean absolute error (MAE) provides a reasonably good approximation
of residual energy
and is easier to calculate than MSE, since it requires a magnitude calculation instead of a
square calculation for each pair of samples:
N-l N-l
Thecomparison maybe
simplifiedfurther by neglectingtheterm
l/N2 andsimply
calculating the sum of absolute errors (SAE) or sum of absolute differences (SAD):
SAE gives a reasonable approximation to block energy and so Equation 6.3 is a commonly
used matching criterion for block-based motion estimation.
100
ESTIMATION
MOTION
AND COMPENSATION
end
start
(start in centre)
(c)
Figure 6.7 SAD map: (a) current block; (b) search area; (c) map with minima
101
102
6.4
FAST SEARCH
The computational complexityof a full search algorithm is often prohibitive, particularly for
software CODECs that must operate inreal time. Many alternative fast search algorithms
have been developed for motion estimation. A fast search algorithm aims to reduce the
number of comparison operations comparedwith full search,i.e. a fast search algorithmwill
sample just a few of the points in the SAE map whilst attempting to find the minimum
SAE. The critical questionwhether
is
the fast algorithm can locate the
true minimum rather
than a local minimum. Whereas the full search algorithm is guaranteed to find the global
minimum SAE, a search algorithm that samples only someof the possible locations in the
search region may get trapped in a local minimum. The result is that the difference block
found by the fast search algorithm contains more energy than the block found
by full search
and hencethenumber
of codedbitsgenerated
by thevideoencoder
will belarger.
Because of this, fast search algorithms usually give poorer compression performance than
full search.
SAE and
103
FAST SEARCH
5. Set S = S/2.
6. Repeat stages 3-5 until S = 1.
Figure 6.9 illustrates the procedure for a search
window of +/-7 (i.e. N = 3). The first step
involves searching location (0, 0) and eight locations +/-4 pixels around the origin. The
second step searches+/-2 pixels around the best match from thefirst step (highlighted in
bold) and the third step searches +/-l
pixels around the best match from the second step
(again highlighted). The best match from this third step is chosen as the result
of the search
algorithm. With a search window of +/-7, three repetitions (steps) are required tofind the
bestmatch. A total of (9+8+8) = 25 searchcomparisonsarerequiredfortheTSS,
comparedwith (15 x 15)= 225 comparisonsfortheequivalentfullsearch.
In general,
(8N+ 1) comparisons are required for a search area of +/-(2N - 1) pixels.
6.4.2 LogarithmicSearch
The logarithmic search algorithm can be summarised as follows:
1. Search location (0, 0).
and vertical directions, S pixels away from the
origin (where S is the initial step size). The five locations make a shape.
3. Set the new origin to the best match (of the five locations tested). If the best match is at
the centre of the +, S = S/2,otherwise S is unchanged.
4. If S = 1 then go to stage 5 , otherwise go to stage 2.
5. Search eight locations immediately surrounding the best match. The search result is the
best match of the search origin and these eight neighbouring locations.
ESTIMATION
MOTION
104
AND COMPENSATION
Figure 6.10 shows an example of the search pattern with S = 2 initially. Again, the best
match at each iteration is highlighted in bold (note that the bold
3 is the best match at
iteration 3 and at iteration 4). In this example 20 search comparisons are required: however,
the number of comparisons varies depending on numberof repetitions of stages 2, 3 and 4
above. Note that the algorithmwill not search a candidate positionif it is outside the search
window (+/-7 in this example).
6.4.3 Cross-Search3
This algorithm is similar to the three-step search except that
five points are compared at each
step (forming an X) instead of nine.
1. Search location (0, 0).
2. Search four locations at +/-S, forming an X shape (where S = 2N- as for the TSS).
3. Set the new origin to be the best match of the five locations tested.
5. If the best matchis at thetop left or bottom rightof the X, evaluate four more points in
an X at a distance of +/-l; otherwise (best match is at the top right or bottom left)
evaluate four more points in a at a distance of +/-l.
Figure 6.11 shows two examples of the cross-search algorithm: in the first example, the
final points are in the shape of a X and in the second, they are in the shape of a
(thebestmatch
at eachiterationishighlighted).Thenumber
of SADcomparisonsis
(4N 5) for a search area of +/-(2N - 1) pixels (i.e. 17 comparisons for a +/-7 pixel
window).
FAST SEARCH
105
6.4.4One-at-a-TimeSearch
This simple algorithm essentially involves following the SAD gradient in the horizontal
direction until a minimum is found, then following the gradient in the vertical directionto
find a vertical minimum:
1. Set the horizontal origin to (0, 0).
6.4.5NearestNeighboursSearch4
This algorithm was proposed for H.263 and MPEG-4 (short header) CODECs. In these
CODECs, each motion vector is predicted from neighbouring (already coded) motion vectors
prior to encoding (see Figure 8.3). This makes it preferable to choose a vector closeto this
106
ESTIMATION
MOTION
AND COMPENSATION
median predictor position, for two reasons. First, neighbouring macroblocks often have
similar motion vectors(so that thereis a good chance that the median predictor will be close
to the true best match). Second, a vector near the median will have a small displacement
and therefore a small VLC.
The algorithm proceeds as follows:
1. Search the (0, 0) location.
2. Set the search origin to the predicted vector location and search this position.
3. Search the four neighbouring positionsto the origin in a
+ shape.
4. If the search origin (or location0, 0 for the first iteration) gives the best match,
this is the
chosen search result; otherwise, set the
new origin to the position
of the best matchand go
to stage 3.
FAST SEARCH
107
The median predicted vector is ( - 3, 3) and this is shown with an arrow. The (0, 0) point
(marked 0) andthe first layer of positions (marked 1) are searched: the best matchis
highlighted. The layer2 positions are searched,followed by layer 3. The best match for layer
3 is in the centre of the shape and so the search is terminated.
This algorithm will perform well if the motion vectors are reasonably homogeneous, i.e.
there are not too many sudden changes in the motion vector field. The algorithm described
in4 includes two further features. First, if the median predictor is unlikely to be accurate
(because too many neighbouring macroblocks are intra-coded and therefore have nomotion
vectors), an alternative algorithm such as the TSS is used. Second, a cost function is
proposed to estimate whether the computational complexity of carrying out the next set of
searches is worthwhile. (This will be discussed further in Chapter 10.)
6.4.6 HierarchicalSearch
The hierarchical search algorithm (and its variants) searches a coarsely subsampled version
of the image first, followed by successively higher-resolution versions until the full image
resolution is reached:
1. Level 0 consists of the current andreferenceframes
at their full resolutions.
Subsample level 0 by a factor of 2 in the horizontal and vertical directions to produce
level 1.
2. Repeat, subsampling level 1 to produce level 2, and so on until the required number of
levels are available (typically, three or four levels are sufficient).
3. Searchthe
vector.
4. Search the next lower level around the position of the coarse motion vector and find the
best match.
5 . Repeat stage 4 until the best match is found at level 0.
The search method used at the highest level may be full search or a fast algorithm such
as TSS. Typically, at each lower level only +/-l pixels are searched around the coarse
vector. Figure 6.14 illustrates themethod with three levels (2, 1 and 0) and a window
of +/-3 positions at the highest level. A full search iscarriedoutat
the top level:
however, the complexity is relatively low because we are only comparing a 4 x 4 pixel area
at each level 2 search location. The best match (the number 2) is used as the centre of the
level 1 search, where eight surrounding locations are searched. The best match (number 1 )
is used as the centre of the final level 0 search. The equivalent search window is +/-l5
pixels (i.e. the algorithm can find a match anywhere within +/-l5 pixels of the origin at
level 0).
In total, 49 searches are carried out at level 2 (each comparing 4 x 4 pixel regions), 8
searches at level 1 (each comparing 8 x 8 pixel regions) and 8 searches at level 0 (comparing
16 x 16 pixel regions).
108
109
COMPARISON OF MOTION
ESTIMATION
ALGORITHMS
4. Scalability: does the algorithm perform equally well for large and small search windows?
5. Implementation: is the algorithm suitable for software
the chosen platform or architecture?
Criteria 1 and 2 appear to be identical. If the algorithm is effective at minimising the energy
in themotion-compensatedresidualblock,thenitought
to providegoodcompression
efficiency (good image quality at low
a compressed bit rate). However, there are other factors
that complicate things: for example, every motion vector that is calculated by the motion
estimation algorithm must be encoded and transmitted as part of the compressed bit stream.
As will be discussedin Chapter 8, larger motion vectorsare usually coded with more bits and
so analgorithm that efficiently minimises the residualframe but produceslargemotion
vectors maybe less efficient thananalgorithmthatis
biased towardsproducingsmall
motion vectors.
Example
In the following example, block-based motion estimation and compensation were carried out
on five frames of thebicyclesequence (shownin Figure 6.2). Table 6.2 compares the
performance of full search motion estimation with a range
of search window sizes. The table
lists the total SAE of the five difference frames without motion compensation (i.e. simply
subtracting the previous from the current frame) and with motion compensation (i.e. blockbased motion compensation on 16 x 16 blocks). The final column lists the total number of
comparison operations (where one operation is the comparison of two luminance samples,
IC, - R,\), As thesearch window increases,motioncompensation efficiency improves
(shown by a smaller SAE): however, the number of operations increases exponentially with
the window size.Thissequencecontainsrelativelylowmovement
and so most of the
Table 6.2 Full search motion estimation, five frames: varying window size
Total
Search
compensated)window
+/-l
+/-3
+/-7
+/-l5
Total SAE
(compensated)
SAE
783
23.4
1 326
...
...
...
1 278 610
581
99.1
1 173 060
898
897 163
Number of
comparison operations
1.0 x lo6
5.2 x lo6
x 10
x 10
110
ESTIMATION
MOTION
AND COMPENSATION
Table 6.3 Motion estimation algorithm comparison, five frames: search window = +/-l5
Number of
Total S A E
Total SAE
(uncompensated)
. operations
Algorithm
(compensated)
comparison
99.1
163Full 897
search
783
326
1
... 3.6
search
Three-step
753
x lo6
x lo6
914
performancegainfrommotionestimation
is achieved with asearch window of +/-7
samples. Increasing thewindow to +/-l5 gives only a modest improvement inSAE at the
expense of a fourfold increase in computation.
Table 6.3 compares the performance of full search and three-step search with a search
window of +/-l5 pixels. Full search produces a lower S A E and hence a smaller residual
frame than TSS. However, the slight increase in SAE produced by the TSS algorithm is
offset by a substantial reduction in the number of comparison operations.
Figure 6.15 shows how a fast search algorithm such as the TSS
may fail tofind the
best possible match. The three-step search algorithm starts
by considering the positions
15
10
-5
-10
-!l5
-10
-5
10
15
111
+/-8 pixels around the origin.The best match at thefirst step is found at
(- 8 , O ) and this is
marked with a circle on the figure.The next step examines positions within
+/-4 pixels of
this point and the bestof these is found at (- 12, - 4). Step 3 also chooses the point(- 12,
-4) and the final step selects( - 13, - 3) as the best match (shown with
a). This pointis
a local minimumbut not the global minimum. Hence the residualblockaftermotion
compensationwillcontainmoreenergythanthebestmatchfound
by thefullsearch
algorithm (point 6, 1 marked with an X).
Of the other search algorithms mentioned above, logarithmic search, cross-search
and
one-at-a-time search providelow computational complexity at the expense
of relatively poor
matching performance. Hierarchical search can give a good compromise between performance and complexity and is well suited to hardware implementations. Nearest-neighbours
search, with its in-built bias towards the median-predicted motion vector,
is reported to
perform almostas well as full search, with a very much reduced complexity. The high performance is achieved because the bias tends to producevery small (and hencevery efficiently
coded) motion vectors and this efficiency offsets the slight drop in SAE performance.
find thebest
3. Subtract the samplesof the matching region (whether full- or sub-pixel) from the samples
of the current block to form the difference block.
Half-pixel interpolation is illustrated in Figure6.16. The original integer pixel positionsa
are shown in black. Samples b and c (grey)
are formed by linear interpolation between pairs
original
integer
a:
43
@ @ B @
samples
112
ESTIMATION
MOTION
AND COMPENSATION
of integer pixels, and samples d (white) are interpolated between four integer pixels (as
indicated by the arrows). Motion compensationwith half-pixel accuracy is supportedby the
H.263 standard, and higher levels of interpolation ($ pixel or more) are proposed for the
emerging H.26L standard.Increasingthedepth
of interpolationgivesbetterblock
matching performance at the expense of increased computational complexity.
Searchingonasub-pixelgridobviouslyrequiresmorecomputationthantheinteger
searches described earlier. In order to limit the increase in complexity, it is common practice
to find thebestmatchingintegerpositionandthentocarryoutasearchathalf-pixel
locationsimmediatelyaroundthisposition.Despitetheincreasedcomplexity,sub-pixel
motion estimation and compensation can significantly outperform integer motion estimation/
compensation. This is because a moving object will
not necessarily move by an integral
number of pixels between successive video frames. Searching sub-pixel locations
as well as
integer locations is likely tofind a good match in a larger number of cases.
Interpolating the reference area shown in Figure
6.7 to half-pixel accuracyand comparing
SAE
the current block with each half-pixel position gives
themap shown in Figure6.17. The
best match (i.e. the lowestS A E ) is found at position(6,0.5). The block found at this position
10
113
in the interpolated reference frame gives a better match than position (6, 1) and hence better
motion compensation performance.
6.7.2 BackwardsPrediction
The prediction efficiency forcases (2) and (3) above can be improved by using a
future frame(i.e. a laterframe in temporal order) as prediction reference. A frame
immediately after a scene cut, or an uncovered object, can be betterpredicted from a future
frame.
Backwards prediction requires the
encoder
to
buffer coded
frames
and
encode
them out of temporal order, so that the future reference frame is encoded before the current
frame.
6.7.3 BidirectionalPrediction
In some cases, bidirectional prediction may outperform forward or backward prediction:
here, the prediction reference is formed by merging forward and backward references.
Forward, backward and bidirectional predictions are all available for encoding an MPEG-1
or MPEG-2 B-picture. Typically, the encoder carries out twomotion estimation searches for
each macroblock (16 x 16 luminance samples), one based on the previous reference picture
(an I- or P-picture) and one based on the future reference picture. The encoder finds the
motion vector that gives the best match (i.e. the minimum SAE) based on (a) the previous
114
COMPENSATION
ESTIMATION
MOTION
AND
reference frame and (b) the future reference frame. A third SAE value (c) is calculated by
subtracting the average of the two matching areas (previous and future) from
the current
macroblock.Theencoderchoosesthemode
of thecurrentmacroblock
based on the
smallest of these three SAE values:
(a) forward prediction
(b) backwards prediction, or
(c) bidirectional prediction.
In this way, the encoder canfind the optimum prediction reference for each macroblock and
this improves compression efficiency by up to 50% for B-pictures.
6.7.4 MultipleReferenceFrames
MPEG-1 or MPEG-2 B-pictures are encoded
using two reference frames. This approach
may
be extendedfurther by allowingtheencoder to chooseareferenceframefromalarge
number of previously encoded frames. Choosing between multiple possible reference frames
can be a useful tool in improving error resilience (as discussedin Chapter 11). This method
is supported by the H.263 standard (Annexes N and U, see Chapter 5) and has been analysed in.5
Encoder and decoder complexity and storage requirements increase as more prediction
reference frames are utilised. Simple forward prediction fromthe previous encoded frame
gives the lowest complexity (but also the poorest compression efficiency), whilst the other
methodsdiscussedaboveaddcomplexity(andpotentiallyencodingdelay)
but give
improved compression efficiency.
Figure 6.18 illustratesthepredictionoptionsdiscussedabove,showingforward
and
backwards prediction from past and future frames.
Fonvard prediction
/
4 - -
/
/
/
Backward prediction
ame
115
ENHANCEMENTSMOTION
TO
MODEL
THE
6.8.1 VectorsThatcanPointOutsidetheReferencePicture
If movement occurs near the edges of the picture, the best match for an edge block may
actually be offset slightly outside the boundaries of the reference picture. Figure 6.19 shows
an example: the ball that has appeared in the current frame is partly visible in the reference
frame and part of the best matching block will be found slightly above the boundary of the
frame. The match may be improved by extrapolating the pixel values at the edge of the
reference picture. Annex D of H.263 supports this type of prediction by simple linear
extrapolation of the edge pixels into the area around the frame boundaries (shown in Figure
6.19). Block matching efficiency and hence compression efficiency is slightly improved for
video sequences containing motion near the edges of the picture.
Current
frame Reference
Figure 6.19 Example of best match found outside the reference picture
116
ESTIMATION
MOTION
AND COMPENSATION
motion vector per macroblock) and 8 x 8 (four vectors per macroblock) : the small block
size is used when it gives better coding performance than the large block size. Motion
compensationperformanceisnoticeablyimprovedat
the expense ofan
increase in
complexity: carrying out 4 searches per macroblock (albeit on a smaller block size
with
only 64 calculations per SAE comparison) requires more operations.
The emerging H.26L standard takes this approach further and supports multiple possible
block sizes for motion compensation within a macroblock. Motion compensation
may be
carried out for sub-blocks with horizontalor vertical dimensions of any combination of 4, 8
or 16samples.Theextremecases
are 4 x 4 sub-blocks(resulting in 16 vectors per
macroblock) and 16 x 16 blocks (one vector per macroblock) with many possibilities in
between (4 x 8, 8 x 8, 4 x 16blocks,etc.).This
flexibility givesafurtherincrease
in
compression performance at the expense of higher complexity.
2. a sample predicted using the motion vector of the adjacent block in the vertical direction
(i.e. the nearest neighbour block above or below) ( R I ) ;
3. asamplepredictedusingthemotionvector
of theadjacentblock
direction (i.e. the nearest neighbour block left or right) (Rz).
in thehorizontal
The final sample is a weighted average ofthe three values. R0 is given the most weight
(because it uses the current blocks motion vector). R1 and R2 are given more weight when
the current sample is near the edge of the block, less weight when it is in the centre of the
block.
The result of OBMC is to smooth the prediction across block boundariesin the reference
frame. OBMC is supported by Annex F of H.263 and gives a slight increase
in motion
compensation performance(at the expense of a significant increase
in Complexity). A similar
smoothing effect can be obtained by applying a filter to the block edges in the reference
frame and later versions of H.263 (H.263 and H.263 ++) recommend using a block filter
instead of OBMC because it gives similar performance with lower computational complexity. OBMCand filteringperformance have beendiscussedelsewhere,6
and filters are
examined in more detail in Chapter 9.
6.8.4 ComplexMotionModels
The motion estimation and compensation schemes discussed so far have assumed a simple
translational motion model, i.e. they work best when all movement in a scene occurs in a
plane perpendicular to the viewer. Of course, there are many other types of movement such
as rotation, movements towards or away from the viewer (zooming) and deformation
of
objects (such as a human body). Better motion compensation performance may be achieved
by matching the current frame to a more complex motion model.
In the MPEG-4 standard, a video object planemay be predicted from the pixels that exist
only within a reference VOP. This is a form of region-based motion compensation, where
After
IMPLEMENTATION
warping
117
Before
Figure 6.20
compensation is carriedoutonarbitrarilyshapedregionsrather
thanfixed rectangular
blocks. This has the capability to provide a more accurate motion model for natural video
scenes (where moving objects rarely have neat rectangular boundaries).
Picture warping involves applying a global warping transformation to
the entire reference
picture for example to compensate for global movements such as camera zoom or camera
rotation.
Mesh-basedmotioncompensation
overlaysthereferencepicture
with a2-D meshof
triangles.Themotion-compensatedreferenceisformed
by movingthecorners
of each
triangleanddeforming
the referencepicturepixelsaccordingly(Figure6.20showsthe
general approach). A deformable mesh can model a wide range of movements, including
object rotation, zooming and limited object deformations. A smaller mesh will give a more
accurate motion model (but higher complexity).
Stillmoreaccuratemodellingmay
be achieved using object-basedcoding wherethe
encoder attempts to maintain a 3-D model of the video scene. Changes between frames
are modelled by moving and deforming the components of the 3-D scene.
Picturewarpingis significantly morecomplex than standardblockmatching.Meshbased and object-based coding are successively more complex and are not suitable for realtimeapplications with currentprocessing technology.However,they
offersignificant
potential for future video coding systems when more processing power becomes available.
These and other motion models are active areas for research.
6.9 IMPLEMENTATION
6.9.1 Software Implementations
Unless dedicated hardware assistance is available (e.g. a motion estimation co-processor),
the key issue in a software implementation of motion estimation is the trade-off between
118
COMPENSATION
ESTIMATION
MOTION
AND
computational complexity (the total number of processor cycles required) and compression
performance. Other important considerations include:
e
out motion estimation for the entire frame before further encoding takes place: however,
this requires more storage and can introduce more delay than an implementation where
each macroblock is estimated, compensated, encodedand transmitted before moving onto
the next macroblock.
Even with the use of fast search algorithms, motion estimationis
often the most
computationally intensive operation in a software video CODEC and so it is important to
find ways to speed up the process. Possible approaches to optimising the code include:
1. Loop unrolling. Figure 6.21 lists pseudocode for two possible versions of the SAE
calculation (Equation 6.3) for a 16 x 16 block. Version (a) is a direct,compact
implementation of the equation. However, each of the16 x 16 = 256 calculationsis
accompanied by incrementing and checking the inner loop counteri. Version (b) unrolls
:a)Direct implementation:
t i
.OtdlSAE
for
j
0;
{
/ / ROW counter
0 to 15
{
/ / Column counter
totalSAE = totalSAE + abs(C[i,jl - R[i+ioffset,j+joffsetl);
= 0 to 15
for i
1
1
[b) Unrolled innerloop:
//
IotalSAE
0;
or j = 0 to 15
totalSAE =
totalSAE =
totalSAE =
totalSAE =
totalSAE =
totalSAE =
totalSAE =
totalSAE =
totalSAE =
totalSAE =
totalSAE =
totalSAE =
totalSAE =
totalSAE =
totalSAE =
totalSAE =
totalSAE
totalSAE
totalSAE
totalSAE
totalSAE
totalSAE
totalSAE
totalSAE
totalSAE
totalSAE
totalSAE
totalSAE
totalSAE
totalSAE
totalSAE
totalSAE
/ / Row counter
+ abs(C[O,jl - R[O+ioffset,j+joffsetI);
+ abs(C[l.jl - R[l+ioffset, j+joffsetl);
+ abs(C[Z,jl - R[Z+ioffset,j+joffsetl);
;
+ abs(C[3,jl - R[3+ioffset, j+joffsetl)
+ abs(C[4,jl - R[4+ioffset,j+joffsetl);
+ abs(C[5,jl - R[5+ioffset,j+joffsetl);
+ abs(C[6,jl - R[ti+ioffset,j+joffsetI);
+ abs(C[7,jl - R[7+ioffset,j+joffsetI);
+ abs(C[8,jl - R[8+ioffset,j+joffsetI);
+ abs(C[9,jl - R[9+ioffset,j+joffsetl);
+ abs(C[lO,jl - R[lO+iofset,j+joffsetl);
+ abs(C[Ll,jl - R[ll+ioffset,j+joffsetl);
+ abs(C[lZ.j] - R[lZ+ioffset,j+joffsetl);
+ abs(C[13,jl - R[13+ioffset,j+joffsetI);
+ abs(CL14,jl - R[14+ioffset,j+joffsetI);
+ abs(CL15,jl - R[15+ioffset,j+joffsetI);
119
IMPLEMENTATION
the inner loop and repeats the calculation 16 times. More lines of code are required but,
on most platforms, version (b) will run faster (note that some compilers automatically
unrollrepetitiveloops,butbetterperformancecanoftenbeachieved
by explicitly
unrolling loops).
2. Hand-coding of critical operations. The SAE calculation for a block (Equation 6.3) is
carried out many times during motion estimation and is therefore a candidate for coding
in assembly language.
3. Reuse of calculated values. Consider the final stage of the TSS algorithm shownin Figure
6.9: a total of nine SAE matches are compared, each 1 pixel apart. This means that most
of theoperations of each SAE matchareidenticalforeachsearchlocation.It
may
therefore be possibletoreduce
the number of operations by reusingsome of the
calculated values lCg - Rql between successive SAE calculations. (However, this may
not be possible if multiple-sample calculations are used, see below.)
4. Calculatemultiplesamplecomparisonsinasingleoperation.
Matching is typically
carried out on 8-bit luminance samples from the current and reference frames. A single
match operation IC, - R01 takes as its input two 8-bitvalues and produces an 8-bit output
value. With a large word width (e.g. 32 or 64 bits) it may be possible to carry out several
matching operations at once by packing several input samples into a word.
Figure 6.22 shows the general idea: here, four luminance samples are packed into each
of two input words and the result
of IC, - Rql for each sample are available as 4the
bytes
of an outputword.Careisrequired
with thisapproach: first, thereis an overhead
associated with packing and unpacking bytes intolout
of words, and second, theremay be
the possibility foroverflow during the comparison (since the resultof Cij - RV is actually
a 9-bit signed number prior to the magnitude operator 11) .
These and further optimisations may be applied to significantly increase the speed of the
search calculation. In general, more optimisation leads to more lines of code that may be
difficult to maintain and may only perform well on a particular processor platform.
However,
increased motion estimation performance can outweigh these disadvantages.
Reference:
sample 1
sample 2
Current:
3
sample 4 sample
sample 1
sample 2
+
1 byte
1
res& 1
Icurrent-reference1
result 2
res& 3
result4
sample 4 sample
120
COMPENSATION
ESTIMATION
MOTION
AND
This check itself takes processing time and so it is not efficient to test after every single
sample comparison: instead, a good approach is to include the above check after each inner
loop (i.e. each row of 16 comparisons).
Row and column projections A projection of each row and column in the current and
reference blocks is formed. The projection is formed by adding all the luminance values in
the current row or column: for a 16 x 16 block, there are 16 row projections and 16 column
projections. Figure 6.23 shows the projections for one macroblock.
An approximation to
SAE is calculated as follows:
N-l
SAE,,,,,
Ccol; - Rcoli
i=O
N-I
Crowj - Rrow;
;=0
Current macroblock
Row
projections
Crow
122
COMPENSATION
ESTIMATION
MOTION
AND
SAE value
I
-2
-1
Figure 6.25
1.5
6.9.2 HardwareImplementations
The design of a motion estimation unit in hardware is subject to a number of (potentially
conflicting) aims:
1. Maximise compression performance.
block matching performance.
used, each of which calculates a single SAE result. The smallest SAE of the M results is
123
IMPLEMENTATION
v *t
Processor
1
v v
Processor
2
..............................
I
Processor
M
v
Comparator
Best match
Figure 6.26
chosen as thebest match (for that particular set of calculations). The number of cycles is
reduced (and the gate count of the design is increased) by a factor of approximately M .
2. Calculate partial SAE results for each pixel position in parallel. For example, the SAE
calculation for a16 x 16 block may be speeded up by using 16 processors, eachof which
calculates the SAE component for one column of pixels in the current block. Again, this
approach has the potential to speed up the calculation by approximately M times (if M
parallel processors are used).
Fast search
It may not be feasible or practical to carry
out a complete full search because
of gate count or
clock speed limitations. Fast search algorithms can perform almost as well as full search
124
Step 1
Step 2
Block 1
Block 2
Block 1
Block 3
Block 4
Block 2
Block 3
Step 3
Block 1
Block 2
with many fewer comparison operations and so these are attractive for hardware as well as
software implementations.
In a dedicated hardware design it
may be necessary to carry out each motion estimation search in a fixed number of cycles (in order toensure that all the processing units within
the design are fully utilised during encoding). In this case algorithms such as logarithmic
search and nearest neighbours search are not ideal because the total number of comparisons
varies from block to block. Algorithms such as the three-step search andhierarchical search
are more useful because the number of operations is constant for every block.
Parallel computation may be employed to speed up the algorithm further, for example:
1. Each SAEcalculationmay
be speeded up by using parallel processing units (each
calculating the SAE for one or more columns of pixels).
2. The comparisons at one step or level of the algorithm may be computed in parallel
(for example, one step of thethree-step search or one level of the hierarchical
search).
3. Successive steps of the algorithm may be pipelined to increase throughput. Table 6.4
shows an exampleforthethree-stepsearch.The
first ninecomparisons (step 1) are
calculated for block 1. The next eight comparisons (step 2) for block 1 are calculated
by another processing unit (or set of units), whilst step 1 is calculated for block 2, and
so on.
Note that the steps or levels cannot be calculated in parallel: the search locations examined
in step 2 depend on theresult of step 1 andso cannot be calculated until the outcome of step 1 is
known.
Option 3 above (pipelining of successive steps) is useful for sub-pixel motion estimation.
Sub-pixel estimation isusually carried out on the sub-pixel positions around the best integer
pixelmatchand
this estimationstep may also be pipelined. Figure 6.27 shows an
examplefor a three-step search (+/-7
pixels) followed by a half-pixel estimation step.
Note that memory bandwidth may an important issue with this type of design. Each step
requires access to the current block and reference area and this can lead to an unacceptably
high level of memory accesses. One option is to copy
the current and reference areas to
separate local memories for each processing stage but this requires more local memory.
Descriptions of hardware implementations of motion estimation algorithmscan be found
elsewhere.&*
125
REFERENCES
+ half-pixel step
6.10 SUMMARY
Motion estimation is used in an inter-frame video encoder to createa model that matches
the current frame as closely as
possible, based on one ormore previously transmitted frames
(reference frames). Thismodel is subtracted from the current frame (motion compensation)
to produce a motion-compensated residual frame. Thedecoder recreates the model (based on
information sent by the encoder) and adds the residual frame to reconstruct a copy of the
original frame.
The goal of motion estimation design is to minimise the amount of coded information
(residual frame and model information), whilst keeping the computational complexity of
motionestimationandcompensationtoanacceptablelimit.
Many reduced-complexity
motion estimation methods exist (fast search algorithms), and these allow the designer to
trade increased computational efficiency against reduced compression performance.
After motion estimation and compensation, the next problem faced by a video CODEC is
to efficiently compress the residual frame. Themost popular method istransform coding and
this is discussed in the next chapter.
REFERENCES
Proc.
NTC, November 1991.
2. J. R. Jainand A. K. Jain,Displacementmeasurement and its applicationininterframeimage
coding, ZEEETrans. Communications, 29, December 1981.
3. M. Ghanbari, The cross-search algorithm for motion estimation, IEEE Trans. Communications,
38, July 1990.
4. M. Gallant, G. C6tC and F. Kossentini, An efficient computation-constrained block-based motion
low bitratevideocoding,
IEEETrans.ImageProcessing,
8(12),
estimationalgorithmfor
1. T. Koga, K. Iinuma et al., Motion compensated interframe coding for video conference,
December 1999.
126
COMPENSATION
ESTIMATION
MOTION
AND
5 . T. Wiegand, X. Zhang and B. Girod, Long-term memory motion compensated prediction, IEEE
Trans. CSVT, September 1998.
6. B. Tao and M. Orchard,Removal of motionuncertaintyandquantizationnoiseinmotion
compensation, IEEE Trans. CVST, 11(1), January 2001.
7. Y. Wang, Y. Wang and H. Kuroda, A globally adaptive pixel-decimation algorithm for block motion
estimation, IEEE Trans. CSVT, 10(6), September 2000.
8. X. Li and C. Gonzales, A locally quadratic model of the motion estimation error criterion function
and its application to subpixel interpolation, IEEE Trans. CSW, 3, February 1993.
MPEG-2 motion estimation, IEEE Trans. CSW, 10(3),
9. Y. Senda, Approximate criteria for the
April 2000.
10. P. Pirsch, N.Demassieux and W. Gehrke, VLSI architectures for video compression-a survey,
Proceedings ofthe IEEE, 83(2), February 1995.
11. C. J. Kuo, C. H. Yeh and S. F. Odeh, Polynomial search algorithm for motion estimation, lEEE
Trans. CSVT, 10(5), August 2000.
12. G. Fujita, T. Onoye and I. Shirakawa, AVLSI architecture for motion estimation core dedicated to
H.263 video coding, IEICE Trans. Electronics, ES1-C(5), May 1998.
Transform Coding
7.1 INTRODUCTION
Transform coding is at the heart of the majority of video coding systems and standards.
Spatialimagedata(imagesamples
or motion-compensated residual samples)aretransformed into a different representation, the transform domain. There are good reasons for
transforming image data in this way. Spatial image data is inherently difficult to compress:
neighbouring samples are highly correlated (interrelated) and the energy tends to be evenly
distributed across an image, making it difficult to discard dataor reduce the precisionof data
without adversely affecting image quality. With a suitable choice of transform, the data is
easier to compress in the transform domain. There are several desirable properties of a
transformforcompression. It shouldcompacttheenergy
inthe image(concentratethe
energy into a small number of significant values); it should decorrelate the data (so that
discarding insignificant datahas a minimaleffectonimagequality);
and it should be
suitable for practical implementation in software and hardware.
The two most widely used image compression transforms are the discrete cosine transform (DCT)and the discrete wavelet transform (DWT). The DCT
is usually applied to small,
regular blocks of image samples (e.g. 8 x 8 squares) and the DWT is usually applied to
larger image sections (tiles)or to complete images.Many alternatives have been proposed,
for example 3-D transforms (dealing with spatial and temporal correlation), variable blocksize transforms, fractal transforms, Gabor analysis.
The DCT has proved particularly durable
and is at the core of most of the current generation of image and video coding standards,
includingJPEG, H.261, H.263,H.263+, MPEG-l, MPEG-2 andMPEG-4.The DWT is
gaining popularity becauseit can outperform the DCT for still image coding and so it is used
in the newJPEG image codingstandard (JPEG-2000) and for still texture coding in MPEG-4.
This chapter concentrates on the DCT. The theory and properties of the transforms are
described first, followed by an introduction to practical algorithms and architectures for the
DCT. Closely linked with the DCT is the process of quantisation and the chapter ends with a
discussion of quantisation theory and practice.
128
TRANSFORM CODING
DCT Coefficients
Samples
0
2-D
FDCT
Figure 7.1
The forward DCT (FDCT) transforms a set of image samples (the spatial domain) intoa
set of transform coefficients (the transform domain). The transform is reversible: the
inverseDCT(IDCT)transforms
a set of coefficients into a set of imagesamples. The
forward and inverse transforms are commonly used in l-D or 2-D forms for image and video
compression. The l-D version transforms a l-D array of samples into an a 1 -D array of
coefficients, whereas the 2-D version transforms a 2-D array (block) of samples into a block
of Coefficients. Figure 7.1 shows the two forms of the DCT.
The DCT hastwo useful properties for image and video compression,energy compaction
(concentratingtheimageenergyinto
a small number of coefficients) and decorrelution
(minimising the interdependencies between coefficients). Figure 7.2 illustrates the energy
compaction property of the DCT. Image (a) is an 80 x 80 pixel image and image(b) plots the
coefficients of the 2-D DCT. The energy in the transformed coefficients is concentrated
about the top-left comer of the array of coefficients (compaction). The top-left coefficients
correspond to low frequencies: there is a peak in energy in this area and the coefficient
values rapidly decrease to the bottom right of the array (the higher-frequency coefficients).
The DCTcoefficients are decorreluted which means that many of the coefficients with small
values can be discarded without significantly affecting image quality. A compact array of
decorrelated coefficients can be compressed much more efficiently than an array of highly
correlated image pixels.
The decorrelation and compaction performance
of the DCT increases with block size.
However, computational complexity also increases (exponentially) with block size. A block
size of 8 x 8 is commonly used in image and video codingapplications. This size gives a good
compromise between compression efficiency and computational efficiency (particularly as
there are a number of efficient algorithms for a DCT of size 2 x 2, where m is an integer).
The forward DCT for an 8 x 8 block of image samples is given by Equation 7. I :
129
i-
Figure 7.2
fi,j
V;:,j)
130
TRANSFORM CODING
131
Figure 7.4
DCTbasispatterns
coefficients
132
TRANSFORM CODING
DCT coefficients
Figure 7.5
(Continued)
133
1 1 1 y: 1 1 1
-7.61-8.4
-14.7
-22.3
1.4
0.8
0.6 1-1.9
4.1
0.0 -0.3
1.3 -0.6
4.3 5.0
-0.5
0.1
-0.2
0.2 -0.3
A reasonable approximation to the original image block can be reconstructed from just
these six coefficients, as shown in Figure 7.6. First, coefficient (0, 0) ismultiplied by a
weight of 967.5andtransformedwith
the inverse DCT. This coefficient representsthe
averageshade of theblock (in this case, mid-grey) and is often described
as the DC
coefficient (the DC coefficient is usuallythemost
significant in anyblock).Figure 7.6
shows the reconstructed block formedby the DC coefficient only (the top-right blockin the
figure). Next, coefficient (1, 0) ismultiplied by a weightof
- 163.4 (equivalentto
subtracting its basis pattern). The weighted basispattern is shown in the second rowof
Figure 7.6 (on the left) and the sum of the first two patterns is shown on theright. As each of
the further four basis patterns is added to the reconstruction, more detail is added to
the
reconstructed block. The final result (shown on the bottom right of Figure 7.6 and produced
using just 6 outof the 64 coefficients) is a good approximation of the original. This example
illustrates the two key properties of the DCT: the significant coefficients are clustered around
the DC coefficient (compaction) and theblockmaybe
reconstructed using only a small
number of coefficients (decorrelation).
134
TRANSFORM CODING
(0.0) 967.5
Reconstructed
(1.0) * -163.4
(1.11 * -71.3
(3,O)* 81.8
(4,O) * 38.9
g
X
F:
Figure7.6
Reconstruction of image
block from six basis patterns
135
TRANSFORM
WAVELET
DISCRETE
(b)
Figure 7.7 (a)Original image; (b) single-stage wavelet decomposition
136
TRANSFORM CODING
dimensions. The top-right comer (W)consists of residual vertical frequencies (i.e. the
vertical component of the difference between the subsampled LL image and the original
image). The bottom-left comer LH contains residual horizontal frequencies (for example,
the accordion keys are very visible here), whilst the bottom-right comer
HH contains
residual diagonal frequencies.
This decomposition process may be repeated for theLL component to produce another
set of four components: a new LL component that is a further subsampled version of the
original image, plus three more residual frequency components. Repeating the decomposition three times gives the wavelet representation shown in Figure 7.8. The small
in the
image
top left is the low-pass filtered original
and the remaining squares contain progressively
higher-frequencyresidualcomponents. This process may berepeatedfurther if desired
(until, in the limit, the top-left component contains only 1 pixel which is equivalent to the
DC or average valueof the entire image). Each sample in Figure 7.8 represents a wavelet
transform coeflccient.
The wavelet decomposition has some important properties. First, the number of wavelet
coefficients (the spatial values that make
up Figure 7.8) is the same as the number
of pixels
in the original imageand so the transform is not inherently adding or removing information.
Second, manyof the coefficientsof the high-frequency components(HH, HL and LH at
eachstage)arezeroorinsignificant.Thisreflectsthefactthatmuch
of theimportant
information in an image is low-frequency. Our response to an image is based upon a lowfrequency overview of the image with important detail added by higher frequencies in a
few significant areas of the image. This implies that
it should be possible to efficiently
Figure 7.8
layer 1
137
Layer 2
compress the wavelet representation shown in Figure 7.8 if we can discard the insignificant
higher-frequency coefficients whilst preserving the significant ones. Third, the decomposition is not restricted by block boundaries (unlike the DCT) and hence may be a more flexible
way of decorrelating the image data (i.e. concentrating the
significant components into a few
coefficients) than the block-based DCT.
The method of representing significant coefficients whilst discarding insignificant coefficients is criticaltotheuse
of wavelets in imagecompression. The embeddedzerotree
approach and, more recently, set partitioning into hierarchical trees (SPIHT) are considered
by some researchers tobe the most effectiveway of .doing t h k 2 The wavelet decomposition
can be thought of as a tree, where the root is the top-left LL component and its branches
arethe successivelyhigher-frequencyLH, HL and HH components at eachlayer.Each
coefficient in a low-frequency component hasa number of corresponding child coefficients
in higher-frequency components. This concept is illustrated
in Figure 7.9, where a single
coefficient at layer 1 maps to four child coefficients in each component at layer 2. Zerotree coding workson the principle that if a parent coefficient is visually insignificant then its
childrenare unlikely to be significant. Working from the topleft,each coefficient and
itschildrenareencoded
as a tree. As soon as thetreereaches
a coefficient that is
insignificant, that coefficientand all its children are coded as a zero tree. The decoder will
reconstruct the significant coefficients and set all coefficients in a zero tree to zero.
Thisapproachprovides
a flexible andpowerful method of imagecompression. The
decision as to whether a coefficient is significant or insignificant is made by comparing it
with a threshold. Setting a high threshold means that most of the coefficients are discarded
and the image is highly compressed; setting
a low threshold means that mostcoefficients are
retained,giving low compression and highimage fidelity. Thisprocess is equivalent to
quantisation of the wavelet Coefficients.
Wavelet-based compression performs well for still images (particularly in comparison with
DCT-based compression) and can be implemented reasonably efficiently. Under high compression, wavelet-compressed imagesdo not exhibit the blocking effects characteristicof the
DCT. Instead, degradation is more gracefuland leads to a gradual blurring of the image as
higher-frequency coefficients are discarded. Figure 7.10 compares the results
of compression
of the original image (on the left) with a DCT-based algorithm (middle image, JPEG compression) and a wavelet-based algorithm (right-hand image, JPEG-2000 compression).
In
138
rRANSFORM CODING
l !l
Figure 7.10 (a) Original; (b) compressed and decompressed (DCT); (c) compressed and decompressed (wavelet)
each case, the compression ratio is16x. Thedecompressed P E G image is clearly distorted
and blocky, whereas the decompressed PEG-2000 image is much closer to the original.
Because of its good performance in compressing images, the DWT is used in the new
PEG-2000 still image compression standard and is incorporated as a still image compression tool in MPEG-4 (see Chapter
4). However, wavelet techniques have not yet gained
widespread support for motion video compression because there isannot
easy way to extend
wavelet compression in the temporal domain. Block-based transforms such as thework
DCT
well with block-based motion estimation and compensation, whereas efficient, computationallytractablemotion-compensationmethodssuitableforwavelet-basedcompression
have not yet been demonstrated. Hence, the DCT is still the most popular transform for video
coding applications.
7.4.1 SeparableTransforms
Many practical implementations of the FDCT and IDCT use the separable property of the
transforms to simplify calculation.The 2-D FDCT can be calculatedby repeatedly applying
the l-D DCT. The l-D FDCT is given by Equation 7.3:
139
DCT THE
ALGORITHMS
FOR
FAST
8 x 8 samples (i, j )
l-D FDCToncolumns
8 x 8 coefficients (x, y)
where F, is the l-D DCT described by Equation 7.3. In other words, the 2-D DCT can be
represented as:
Fx.y
The 2-D DCT of an 8 x 8 block can be calculated in two passes: a l-D DCT of each row,
followed by a l-D DCTofeachcolumn
(or vice versa). Thisproperty is known as
separability and is shown graphically in Figure 7.11.
The 2-D IDCT can be calculated using two l-D IDCTs in a similar way. The equation for
the l-D IDCT is given by Equation 7.5:
2. Maximal regularity: the l-D or 2-D calculations are organised to regularise the data flow
and processing order. This approach is suitable for dedicated hardware implementations.
In general, l-D implementations(usingtheseparablepropertydescribedabove)
are less
complex than 2-D implementations, but
it is possible to achieve higher performance by
manipulating the 2-D equations directly.
140
CODING
TRANSFORM
7.4.2 FlowgraphAlgorithms
The computational requirements of the l-D DCT may be reduced significantly by exploiting
the symmetry of the cosine weighting factors. We show how the complexity of the DCT can
be reduced in the following example.
F2
The following properties of the cosine function can be used to simplify Equation 7.7:
and
cos(?)
=-cos(F)
=cos($)
= -cosr$)
These relationships are shown graphically in Figure 7.12 wherecos(7r/8), cos(7n/8), etc. are
plotted as circles ( 0 ) and cos(37r/8), cos(57r/8), etc. are plotted as stars (*).
Using the above relationships, Equation 7.7 can be rewritten as follows:
F2
1
2
= - [.cos(;)
-f5 cos
hence:
cos(?)
+fl
(g)+ (g)+
f6
cos
hcos(3
f7
cos(;)]
f3cos(3
f.cOs(;)
141
1
0.8
0.4
o.6
i \
l
-0.8
-1
I
_
The calculation for F2 has been reduced from eight multiplications andeight additions
(Equation 7.7) to two multiplications and eight additions/subtractions (Equation 7.9).
Applying a similar process to F6 gives
The additions and subtractions are clearly the same as in Equation 7.9. We can therefore
combine the calculations of F2 and Fh as follows:
bl
cos
(i)
+
(g)]
D2 cos
and
In total, the two steps require X additions or subtractions and 4 multiplications, compared
with 16 additions and 16 multiplications for the full calculation. The combined calculations
of F2 and Fh can be graphically represented by ajowgraph as shown in Figure 7.13. In this
figure, a circle represents addition and a square represents multiplication by a scaling factor.
For clarity, thecosinescalingfactorsare
represented as cX, meaning multiply by
cos (Xn/16). Hence, cos(r/X) is represented by c2 and cos(3r/X) is represented by c6.
This approach canbe extended to simplify the calculation of F. and F4, producing the top
half of the flowgraph shown in Figure 7.14. Applying basic cosine symmetries doesnot give
142
CODING
TRANSFORM
fo
fl
f2
2F2
f3
2F6
f4
Multiply by -1
f5
Multiplybycos(Xpiil6)
Add
f7
such a useful result for the odd-numbered FDCT coefficients ( I , 3 , 5 , 7). However, further
manipulation of the matrix of cosine weighting factors can simplify the calculation of these
coefficients. Figure 7.14 shows a widely used example of a fast DCT algorithm, requiring
only 26 additionskubtractions and 20 multiplications (in fact, this can be reduced to 16
multiplications by combining some of the multiplications by c4).This is considerably
simpler than the 64 multiplies and 64 additionsof the direct l-D DCT. Each multiplication
is by a constant scaling factor and these scaling factors may be pre-calculated to speed up
computation.
In Figure 7.14, eight samples fo . . .f, are input at the left and eight DCT coefficients
2Fo . . .2F7 are output at the right. A I-D IDCT may be carried out by simply reversing the
direction of the graph, i.e. the coefficients 2Fo. . .2F7 are now inputs and the samplesfo . . .,f,
are outputs.
By manipulating the transformoperations in different ways, many other flowgraph
algorithms can be obtained. Each algorithm has characteristics that may make it suitable
for a particular application: for example, minimal numberof multiplications (for processing
platformswheremultiplications
are particularly expensive),minimal total number of
operations (where multiplications are not computationally expensive), highly regular data
flow, and so on. Table 7.2 summarises the features of some popular l-D fast algorithms.
Arais algorithmrequiresonly
five multiplications, making itvery
efficient for most
processing platforms; however, this algorithm results in incorrectly scaled coefficients and
this must be compensated for by scaling the quantisation algorithm (see Section 7.6).
143
fo
2Fo
fl
2F4
f2
2F2
f3
2F6
f4
2F1
f5
2F5
f6
2F3
f7
2F7
Figure 7.14 CompleteFDCTflowgraph(fromChen,FralickandSmith)
Source
Multiplications
Additions
'Direct'
64
16
12
11
5
64
Chen3
~
~oeffler~
Arai6
26
29
29
28
144
CODING
TRANSFORM
c
7
F,
Ci,,fi
where
i=O
Ci.x = 2
(2i
+ 1)XT
(7.1 1 )
Thel-DFDCT isthesum
of eightproducts,whereeachproducttermisformed
by
multiplying an input sampleby a constant weighting factorCi,*.The first stage of Chens fast
algorithm shown in Figure 7.14 is a series of additions and subtractions and these can be
used to simplify Equation 7.1 1. First, calculate four sums (U)and four differences ( v ) from
the input data:
(7.12)
Equation 7.11 can be decomposed into two smaller calculations:
(7.13)
(7.14)
i=O
In this form, the calculations are suitable for implementation using a technique known as
distributedarithmetic (first proposedin1974forimplementingdigitalfilters).Each
multiplicationiscarriedout
a bit at a time,using a look-uptableand
an accumulator
(rather than a parallel multiplier).
A B-bit twos complement binary number n can be represented as:
(7.15)
ALGORITHMS
FAST
145
H is the most significant bit (MSB) of n (the sign bit) and n j are the remaining ( B - 1) bits
of n.
Assuming that each input U , is a B-bit binary number in twos complement form,Equation
7.13 becomes
(7.16)
Rearranging gives
(7.17)
or
D,!&) is a function of the bits at position j in each of the four input values: these bits are ub,
U < , U/ and ud. This means that there are only 24 = 16 possible outcomesof D, and these 16
outcomes may be pre-calculated and stored in a look-up table. The FDCT describe by
Equation 7.18 can be carried out by a series of table look-ups ( D J , additions
and shifts
( 2 - 3 . In this form, no multiplication is required and thismaps efficiently to hardware
(see Section 7.5.2).Asimilarapproachistaken
to calculate the four odd-numbered
FDCT coefficients F 1 , F3, F5 and F7 and the distributed form may also be applied to the
l-D IDCT.
(c)
146
TRANSFORM CODING
draft standard (see Chapter5), where an integer DCT approximation isdefined as part of the
standard to facilitate low-complexity implementationswhilst retaining compliance with the
standard.
Software DCT
Computational cost of multiplication. Some processors take many cycles to carry out a
multiplication, others are reasonably fast.
Alternative flowgraph-based algorithms allow the
designer to trade the number of multiplications against the total number of operations.
Fixed vs. floating-point arithmetic capabilities. Poor floating-point performance
compensated for by scaling the DCT multiplication factors to integer values.
may be
Register availability. If the processor has a small number of internal registers then temporary variables should be kept to a minimum and reused where possible.
Availability of dedicated operations, Custom operations such as digital signal processor
(DSP) multiply-accumulate operations and the Intel MMX instructions may be used to
improve the performance of some DCT algorithms (see Chapter 12).
Example
Figure 7.15 lists pseudocode forChens algorithm (shown in Figure7.14). (Only thetop-half
calculations are given for clarity). The multiplication factors CX are pre-calculated constants.
In this example, floating-point arithmetic is used: alternatively, the multipliers CX may be
scaled up to integers and the entire DCT may
be carried out using integer arithmetic (in
whichcase,the
final resultsmust be scaledbackdowntocompensate).Thecosine
multiplicationfactors never changeand so these may be pre-calculated (in this case as
floating-point numbers). A l-D DCT is applied to each row in turn, then to each column.
Note the use of a reasonably large number of temporary variables.
Furtherperformanceoptimisationmay
be achieved by exploitingtheflexibility of a
softwareimplementation.Forexample,variable-complexityalgorithms(VCAs)
may be
applied to reduce the number of operations required to calculate the DCT and IDCT (see
Chapter 10 for some examples).
147
constant c4 = 0.707107
constant c2 = 0.923880
constant c6 = 0.382683
/ / (similarly for cl, c3, c5 and c7)
or (every row) {
io = fO+f7
/ / First stage
il = fl+f6
i2 = fZ+f5
i3 = f3+f4
i4 = f3-f4
i5 = fZ-f5
i6 = fl-f6
i7 = fO-f7
j0 = io
+ i3
/ / Second stage
jl = il+ i2
j2 = il
j3 = io
i2
i3
kO = (j0
+ jl)
c4
kl = (j0
c4
jl)
/ / Third stage
k2 = (j2*c6) + (j3*c2)
k3 = (j3*c6)
(j2*c2)
PO = k0>>1
F4 = kl>>l
F2 = k2>>1
F6 = k3>>1
/ / (Fl..F7 require another stage of multiplications and additions)
I
148
TRANSFORM CODING
select
input
samples
output
coefficients
1-D DCT
7.5.2
Hardware DCT
Dedicatedhardwareimplementations
ofthe FDCT/IDCT(suitableforASICor
FPGA
designs, for example) typically make use of separable l-D transforms to calculate the 2-D
transform. The two sets of row/column transforms shown in Figure 7.1 1 may be carried out
using a single l-D transformunit by transposingthe 8 x 8 arraybetweenthetwo
I-D
transforms. i.e.
Input data +l-D transform on rows
-+
+ Transpose
array
An 8 x 8 RAM (transposition RAM) maybe used to carry out the transposition. Figure
7.16
shows an architecture for the 2-D DCT that
uses a l-D transform core together with a
transposition RAM. The following stages are required to calculate
a complete 2-D FDCT
(or IDCT):
1. Load input data in row order; calculate l-D DCT of each row; write into transposition
RAM in row order.
2. Read from RAM in column order; calculatel-D DCT of each column; write into RAMin
column order.
3. Read output data from RAM in row order.
There are a number of options for implementing the l-D FDCT or IDCT core. Flowgraph
algorithms are not ideal for hardware designs: the data flow is not completely regular and it
is not usually possible to efficiently reuse arithmetic units (such as adders and multipliers).
Two popular and widely used designs are the parallel multiplier and distributed arithmetic
approaches.
4 bits from
U inputs
2-1
149
address
Coefficient
ROM
F,
Accumulator
F
add or subtract
Parallel multiplier
This is a more or less direct implementation of Equations 7.13 and 7.14 (four-point sum of
products). After an initial stage that calculates
the four sums (U) and differences (v) (see
Equation 7.12), each sum-of-products result is calculated. There are 16 possible factors Ci,,
for each of the two 4-point DCTs, and these factors may be pre-calculated to simplify the
design. High performancemay be achieved by carrying out the four multiplications for each
result in parallel; however, this requires four large parallel multipliers andmay be expensive
in terms of logic gates.
Distributed arithmetic
The basic calculation of the distributed arithmetic algorithmis given by Equation 7.18. This
calculationmaps to the hardwarecircuitshown
in Figure 7.1 7, known as a ROMAccumulator circuit. Calculating each coefficient F, takes a total of B clock cycles and
proceeds as follows (Table 7.3). The accumulator is reset to zero at the start. During each
Table 7.3 Distributed arithmetic: calculation of one coefficient
-
~~
Bit
position
ROM input
ROM
Accumulator
contents
output
-1
B- 1
(i.e. bit ( B - 1) of
and u3)
(i.e. bit ( B - 2 ) of
M O . M I , u2 and u3)
M O , M I , u2
-2
B-2
D,(UB - l )
D,(uB
D&P2)
D,(uB-2)
l)
+ [D,(uB-
l)
>> l ]
..
...
1
0
...
D,(ul)
D,(uo)
+ (previouscontents >> 1)
+ (previouscontents >> 1)
D,(u')
-D,(uo)
150
TRANSFORM CODING
7.6 QUANTISATION
In a transform-based video CODEC, the transform stage is usually followed by a quantisation stage. The transforms described in this chapter (DCT and wavelet) are reversible
i.e.
applying the transform followed by its inverse to image data results in the original image
data. This means that the transform process does
not remove any information; it simply
representstheinformationinadifferentform.
The quantisationstageremovesless
important information (i.e. information that does not have a significant influence on
the
appearance of the reconstructed image), making it possible to compress the remaining data.
In themainimageandvideocodingstandardsdescribed
in Chapters 4 and 5, the
quantisationprocessissplitintotwoparts,anoperationintheencoderthatconverts
transform coefficients intolevels (usually simply describedas quantisation) and an operation
in thedecoderthatconvertslevelsintoreconstructedtransformcoefficients(usually
described as rescaling or inverse quantisation). The key to this process is that, whilst the
originaltransformcoefficients
may takeonalargenumber
of possiblevalues(like
an
analogue,continuoussignal),thelevelsandhencethereconstructed
coefficients are
restricted to adiscreteset
of values. Figure 7.18 illustratesthequantisationprocess.
Transform coefficients on a continuous scale are quantised to a limited number of possible
levels. The levels are rescaled to produce reconstructed coefficients with approximately the
same magnitude as the original coefficients but a limited number of possible values.
Coefficient value
may be anywhere
in this range
Origlnal coeffioents
Quantised values
Rescaled coefficients
QUANTISATION
151
A sparse matrix containing levels with a limited number of discrete values (the result of
quantisation) can be efficiently compressed.
Thereis, of course,adetrimentaleffect
to imagequalitybecausethereconstructed
coefficients are not identical to the original set of coefficients and hence the decoded image
will not be identical to the original.The amount of compression andthe loss of image quality
depend on the number of levels produced by the quantiser. A large number of levels means
that the coefficient precision is only slightly reduced
and compression is low; a small number
of levelsmeansa
significant reduction in coefficient precision(andimagequality)but
correspondingly high compression.
Example
The DCT coefficients from Table 7.1 are quantised and rescaled with (a) a fine quantiser
(with the levels spaced at multiplesof 4) and (b) a coarse quantiser (with the levels spaced
at multiples of 16). The results are shown in Table 7.4. The finely quantised coefficients (a)
retainmost of theprecision of theoriginalsand
21 non-zero coefficients remainafter
quantisation. The coarsely quantised coefficients (b) have lost much of their precision and
only seven coefficients are left after quantisation (the
six coefficients illustrated in Figure 7.6
plus [7, 01). The finely quantised block will produce a better approximation of the original
image block after applying the IDCT; however, the coarsely quantised block will compress
to a smaller number of bits.
Table 7.4 Quantisedandrescaledcoefficients:
152
TRANSFORM CODING
output
4 -3 --
2 --
l
4
-3
-2
-1
I
Input
Linear
The set of input values map to a set of evenly distributed output values and an example is
illustrated in Figure 7.19. Plottingthemappingin
this way producesacharacteristic
staircase. A linear quantiser is appropriate when it is requiredto retain themaximum
precision across the entire range of possible input values.
Nonlinear
The set of output values are not linearly distributed; this means that input values are treated
differently depending on their magnitude. A commonly used example is a quantiser with a
dead zone about zero, as shown in Figure 7.20. A disproportionately wide range of lowvalued inputs are mapped to a zero output. This has the effectof favouring larger values at
the expense of smaller ones,i.e. small input values tend to be quantised to zero, whilst larger
values are retained. This type of nonlinear quantiser may be used, for example, to quantise
residual image data in an inter-coded frame. The residual DCT coefficients (after motion
compensation and forward DCT) are distributed about zero.
A typical coefficient matrix will
contain a large number of near-zero values (positive and negative) and a small number of
higher values and a nonlinear quantiserwill remove the near-zero valueswhilst retaining the
high values.
QUANTISATION
153
output
I l l
Iil
I
4-2
l-
Input
zone
tt
-3
-4
Figure 7.21 shows the effectof applying two different nonlinear quantisers to a sine input.
The figure shows the input together with the quantised and rescaled output; note the dead
zone about zero. The left-hand graph shows a quantiser with
11 levels and the right-hand
graph shows a coarser quantiser with only 5 levels.
IRECl
= QUANT. ( 2 .
lLEVELl+ 1 )
LEVEL # 0)
(if LEVEL = 0)
LEVEL is the decoded level prior to rescaling andREC is the rescaled coefficient. The sign
of REC is the same as the sign of LEVEL. QUANT is a quantisation scale factor in the
range 1-31. Table 7.5 givessomeexamples
of reconstructed coefficients forafew
of
the possible combinations of LEVEL and QUANT. The QUANT parameter controls the step
154
TRANSFORM CODING
25
-25
(b)
Figure 7.21 Nonlinear quantisation of sine wave: (a) low quantisation; (b) high quantisation
size of the reconstruction process: outside the dead zone (about zero), the reconstructed
values are spaced at intervals
of (QUANT * 2). A larger value of QUANT means morewidely
spaced reconstruction levels and this inturn gives higher compression (and poorer decoded
image quality).
The other half of the process, the forward quantiser,is not defined by the standard. The
design of the forward quantiser determines the range of coefficients (COEF) that map to
each of the levels. There are many possibilities here: for example, one option is to design
the
155
QUANTISATION
1
I
Levels
-1
...
-5
-3
...
-9
-5
135
-15
-9
270
219
15
-19
-11
27
11
19
QUANT
...
...
-2
17
35
...
range(7<COEF<15)maptoll;andsoon..
However, the quantiser shown in Figure 7.22 is not necessarily the best choice for intercoded transform coefficients. Figure 7.23 shows the distributionof DCT coefficients in an
MPEG-4codedsequence:most
of thecoefficientsareclusteredaboutzero.
Given a
quantised coefficient c, the original coefficient c is more likely to have a low value than
>-
27
>-
19
>-
11
>-
>-
-11
=--
-19
15
-7
-1 5
L4
-23b
W
-31
Original
coefficient
Quantised
and
rescaled
Values
- -27
156
TRANSFORM CODING
MPEG-4 P-plctures distribution of coeticients
10
20
30
40
50
60
Coffictent (reordered)
a high value. A better quantiser mightbias the reconstructedcoefficients towards zero; this
means that, on average, the reconstructed values will be closer to the original values (for
original coefficient values that areconcentrated about zero).Anexample of a biased
forward quantiser design is given in Appendix I11 of the H.263++ standard:
7.6.3 QuantiserImplementation
Forward quantisation maps an inputcoefficient to one of a set of possible levels, depending
onthevalue
of a parameter QUANT. As Equation 7.20 implies, this can usually be
implemented as a division (or as a multiplication by an inverse parameter). The rescaling
process (e.g. Equation 7.19) can be implemented as a multiplication.
QUANTISATION
>-
-l8
157
7 - -19
-26P
-34
Original coefficientvalues
- -27
Quantised and
reSCaled values
7.6.4 VectorQuantisation
In the examples discussed above, each sample (e.g. a transform coefficient) was quantised
independently of all other samples (scalar quantisation). In contrast, quantising a group of
samples as aunit (vector quantisation) can offer more scope for efficient compression.21 In
its basic form, vector quantisation is applied in the spatial domain
(i.e. it does not involve a
transform operation). The heart of a vector quantisation (VQ) CODEC is a codebook. This
contains a predetermined setof vectors, where each vector is a blockof samples or pixel.A
VQ CODEC operates as follows:
158
TRANSFORM CODING
Encode
Find best
block
Decode
Transmit
Codebook
Codebook
Vector 1
Vector 2
Vector 1
Vector 2
Vector N
Vector N
output
block
2 . For each block, choose a vector from the codebook that matches the block as closely as
possible.
3. Transmit an index that identifies the chosen vector.
4. The decoder extracts the appropriate vector and uses this to represent the original image
block.
Tree search VQ
In order to simplify the search procedurein a VQ encoder, the codebook is partitioned into a
hierarchy. At each level of the hierarchy, the input image block is compared with just two
159
QUANTISATION
Input block
1,
Level 0
Level 1
Level 2
...
...
Figure 7.26 Tree-structuredcodebook
possible vectors and the best match is chosen.At the next level down, two further choices are
offered (based on the choice at the previous level), and so on. Figure 7.26 shows the basic
technique: the input block is first compared with two root vectors A and B (level 0). If A is
chosen, the next comparison is with vectors C and D; if B is chosen, the next level chooses
between E and F; and so on. In total, 2logzN comparisons are required for a codebook of
N vectors. This reduction in complexity is offset against a potential loss of image quality
compared with a full search of the codebook, since the algorithm is not guaranteed to find
the best match out of all possible vectors.
Practical considerations
Vector quantisation is highly asymmetrical in terms of computational complexity. Encoding
involves an intensive search operation for every image block, whilst decoding involves a
simple table look-up. VQ (in its basic form) is therefore unsuitable for many two-way video
160
TRANSFORM CODING
7.7 SUMMARY
The most popular method of compressing images (or motion-compensated residual frames)
is by applying a transform followed by quantisation. The purpose of an image transform is to
decorrelatetheoriginalimagedata
and tocompacttheenergy
of theimage.After
decorrelationandcompaction,most
ofthe imageenergyisconcentratedintoasmall
number of coefficients which are clustered together.
The DCT is usually applied to 8 x 8 blocks of image or residual data. The basic 2-D
transformisrelativelycomplex
to implement,butthecomputationcan
be significantly
reduced first by splitting it into two
l-D transforms and second by exploiting symmetry
properties to simplify each 1 -D transform. Flowgraph-type fast algorithms are suitable for
software implementations anda range of algorithms enable the designer to tailor the choice
of FDCT to the processing platform. The more regular parallel-multiplier or distributed
arithmetic algorithms are better suited to dedicated hardware designs.
The design of the quantiser can have an important contribution to image quality
inan
image or video CODEC. After quantisation, the remaining
significant transform coefficients
are entropy encoded together with side information (such as headers and motion vectors)
to form a compressedrepresentation of theoriginalimage or videosequence. The next
chapter will examine the theory and practice of designing efficient entropy encoders and
decoders.
REFERENCES
161
REFERENCES
1. N. Ahmed,T. Natrajan and K. R. Rao, Discrete cosine transform,IEEE Trans. Computers, January
1974.
2. W. A. Pearlman, Trends of tree-based, set-partitioning compression techniques instill and moving
image systems, Proc. PCSOl, Seoul, April 2001.
3. W-H. Chen, C. H. Smith and S. C. Fralick, A fast computational algorithm for the discrete cosine
transform, IEEE Trans. Communications, COM-25, No. 9, September 1977.
4. B. G. Lee, A new algorithm to compute the discrete cosine transform, IEEE Trans. ASSP, 32(6),
December 1984.
5. C. Loeffler, A. Ligtenberg and G. Moschytz, Practical fast l-D DCT algorithms with 11 multiplications, Proc. ICASSP-89, 1989.
6. Y. Arai, T. Agui and M. Nakajima, A fast DCT-SQ scheme for images, Trans. ofthe IEICE E,
71(1 l), November 1988.
7. M.Vetterli and H. Nussbaumer,SimpleFFTandDCTalgorithmswithreducednumber
of
operations, Signal Processing, 6(4), August 1984.
8. E. Feig and S. Winograd, Fast algorithms for the discrete cosine transform, IEEE Trans. Signal
Processing, 40(9), September 1992.
9. F. A. Kamangar and K. A. Rao, Fast algorithms for the 2-D discrete cosine transform,
IEEE Trans.
Computers, 31(9), September 1982.
10. M. Vetterli, Fast 2-D discrete cosine transform, Proc. IEEE ICASSP, 1985.
11. A. Peled and B. Liu, A new hardware realization of digital filters, IEEE Trans. ASSP, 22(6),
December 1974.
12. J. R. Spanier, G. Keane, J. Hunter and R. Woods, Low power implementationof a discrete cosine
transform IP Core, Proc. DATE-2000, Paris, March 2000.
13. T. D. Tran, The BinDCT fast multiplierless approximation of the DCT, IEEE Signal Processing
Letters, 7, June 2000.
14. M. Sanchez, J. Lopez, 0.Plata, M. Trenas andE. Zapata, An efficient architecture for the in-place
fast cosine transform, .I. VLSI Sig. Proc., 21(2), June 1999.
15. G. Aggarwal and D. Gajski, Exploring DCT Implementations, UC Irvine Tech Report TR-98-10,
March1998.
16. G. A. Jullien, VLSI digital signal processing: some arithmetic issues,
Proc. SPIE, 2846, Advanced
Signal Processing Algorithms, Architectures and Implementations, October 1996.
of a16 * 16discretecosine
17. M.T. Sun, T. C. ChenandA.Gottlieb,VLSIimplementation
transform, IEEE Trans. Circuits and Systems, 36(4), April 1989.
18 T-S. Chang, C-S. Kung andC-W. Jen, A simple processor core design for DCTLDCT,
IEEE Trans.
on CSVr, 10(3), April 2000.
19. P. Lee and G. Liu, An efficient algorithm for the 2D discrete cosine transform,
Signal Processing,
55(2), Dec. 1996, pp. 221-239.
20. Y. Shoham and A. Gersho, Efficient bit allocation for an arbitrary set of quantisers, IEEE Trans.
ACSSP, 32(3), June 1984.
21. N. Nasrabadi and R.King,Imagecodingusingvectorquantisation:areview,
IEEE Trans.
Communications, 36(8), August 1988.
Entropy Coding
8.1 INTRODUCTION
A video encoder contains two main functions: a source model that attempts to represent a
video scene in a compact form that
is easy to compress (usually an approximation of the
original video information) and anentropy encoder that compresses the output of the model
prior to storage and transmission. The source model is matched to the characteristics of the
input data (images or video frames), whereas the entropy coder may use general-purpose
statistical compressiontechniques that arenotnecessarilyuniqueintheirapplication
to
image and video coding.
As with the functions described earlier (motion estimation and compensation, transform
coding,quantisation),thedesign
ofan
entropyCODECisaffectedbyanumber
of
constraints including:
1. Compression eficiency: the aim is to represent the source model output using as few bits
as possible.
3. Error robustness: if transmission errors are likely, the entropy CODEC should support
recovery from errors and should
(if possible) limit error propagation
constraint may conflict with (1) above).
at decoder (this
In a typical transform-based video CODEC, the data to be encoded by the entropy CODEC
falls into three main categories: transform coefficients
(e.g. quantisedDCT coefficients),
motion vectors and side information (headers, synchronisation markers, etc.). The method
of coding side information depends on the standard. Motion vectors can often be represented
compactly in a differential form due to the high correlation between vectors for neighbouring
blocks or macroblocks.Transformcoefficientscanberepresented
efficiently withrunlevel coding, exploiting the sparse nature of the DCT coefficient array.
An entropy encoder maps input symbols (for example, run-level coded coefficients) to a
compressed data stream. It achieves compression by exploiting redundancy
in the set of
input symbols, representing frequently occurring symbols with a small number
of bits and
infrequently occumng symbols with a larger number of bits. The two most popular entropy
encodingmethodsusedinvideocodingstandardsareHuffmancodingandarithmetic
coding.Huffmancoding
(or modifiedHuffmancoding)representseachinputsymbol
by avariable-lengthcodewordcontaining
an integralnumber of bits. It is relatively
164
ENTROPY CODING
straightforwardtoimplement,butcannotachieveoptimalcompressionbecause
of the
restriction that each codeword must contain an integral number of bits. Arithmetic coding
maps aninputsymbolintoafractionalnumber
of bits,enablinggreatercompression
efficiency at the expense of higher complexity (depending on the implementation).
2. Run-level coding. This stage attempts to find a more efficient representation for the large
number of zeros (48 in this case).
3. Entropy coding. The entropy encoder attempts to reduce the redundancy
of the data symbols.
Reordering
The optimum method of reordering the quantised data depends on the distribution
of the
non-zero coefficients. If the original image (or motion-compensated residual) data is evenly
DC
(intra-coding)
165
DATA SYMBOLS
distributedin the horizontal and vertical directions (i.e. thereis not a predominance of
strong image features in either direction), then the significant coefficients
will also tend to
be evenly distributed about the top left of the array (Figure 8.2(a)). In this case, a zigzag
reordering pattern such as Figure 8.2 (c) should group together the non-zero coefficients
Typical coefficientmap: frame coding
8000
6000.
4000.
2000
O
0
2P
2000
O
0
j
b
8 8
(b)
Figure 8.2 Typicaldatadistributionsandreorderingpatterns:(a)evendistribution;(b)field
distribution;(c) zigzag; (d) modified zigzag
166
ENTROPY CODING
efficiently. However, in some cases an alternative pattern performs better. For example, a
field of interlaced video tends to vary more rapidly
in the vertical than in the horizontal
direction (because it has been vertically subsampled). In this case the non-zero coefficients
are likely to be skewed as shown in Figure 8.2(b): they are clustered more to the left
of the
array (corresponding to basis functions with a strong vertical variation, see for example
8.2(d) should perform better at
Figure 7.4). A modified reordering pattern such as Figure
grouping the coefficients together.
Run-level coding
The output of the reordering process is a linear array
of quantised coefficients. Non-zero
coefficients are mainly grouped together near the start of the array and the remaining values
inthe array arezero.Longsequences
of identicalvalues(zerosinthiscase)can
be
of zeros preceding a
represented as a (run, level) code, where (run) indicates the number
non-zero value and (level) indicates the sign and magnitude of the non-zero coefficient.
The following example illustrates the reordering and run-level coding process.
Example
Theblock of coefficientsinFigure
Figure 8.2 and the reordered array is
8.1 isreorderedwiththezigzag
run-level coded.
Reordered array:
[102, -33, 21, -3, -2,
-3, -4, - 3 , 0 , 2 , 1,0, 1,0, -2,
0,0,0,0,0,0,0,0,1,0 ...l
scan shown in
Run-level coded:
(0, 102) (0, -33) (0, 21) (0, -3) (0, -2) (0, -3) (0, -4) (0, -3) (1, 2) (0, 1 ) (1, 1 )
(1, -2) (0, - 1) (0, -1) (4, -2) (11, 1)
167
DATA SYMBOLS
Two special cases need to be considered. Coefficient (0, 0) (the DC coefficient) is important to the appearance of the reconstructed image block and has no preceding zeros. In an
intra-coded block (i.e. coded without motion compensation), the DC coefficient is rarely
In an H.263 CODEC, intra-DC
zero and so is treated differently from other coefficients.
coefficients are encoded with a
fixed, relatively low quantiser setting (to preserve image
quality) and without (run, level) coding. Baseline JPEG takes advantage
of the property that
neighbouringimageblockstendtohavesimilarmeanvalues(andhencesimilar
DC
coefficientvalues)andeachDCcoefficientisencodeddifferentiallyfromtheprevious
DC coefficient.
The second special case is the final run of zeros in a block. Coefficient (7, 7) is usually
zero and so we need a special case to deal with the final run of zeros that has no terminating
non-zero value. In H.261 and baseline JPEG, a special code symbol, end of block or EOB,
is inserted after the last (run, level) pair. This approach is known as two-dimensional runjust two values (run and level).The method doesnot
level coding since each code represents
perform well underhighcompression:inthiscase,manyblockscontainonlya
DC
coefficient and so the EOB codes make up a significant proportion of the coded bit stream.
H.263 and MPEG-4 avoid this problemby encoding a flag along with each (run, level)pair.
This last flag signifies the final (run, level) pair in the block and indicates to
the decoder
thattherest
of theblockshouldbe
filled withzeros.Eachcode
now representsthree
so this method is known as three-dimensional run-level-last
values (run, level, last) and
coding.
Motion vectors
The vectordisplacementbetweenthecurrentandreferenceareas
(e.g. macroblocks)is
encoded along with each dataunit. Motion vectors for neighbouring data units are oftenvery
similar, and this property may be used to reduce the amount of information required to be
encoded.In anH.261 CODEC,forexample,themotionvectorforeachmacroblock
is
predicted from the preceding macroblock. The difference between the current and previous
A more
vectorisencodedandtransmitted(instead
of transmittingthevectoritself).
sophisticatedpredictionisformedduringMPEG-4/H.263coding:thevectorforeach
macroblock (or block if the optional advanced prediction mode is enabled) is predicted
from up to three previously transmitted motion vectors. This helps
to further reduce the
transmittedinformation. These two methods of predictingthecurrentmotionvector
are
shown in Figure 8.3.
Example
Motion vector of current macroblock:
Predicted motion vector from previous macroblocks:
Differential motion vector:
x = +3.5,
x = f3.0,
dx = +0.5,
y = +2.0
y = 0.0
dy = -2.0
ENTROPY CODING
168
Current
macroblock
Current
macroblock
Quantisation parameter
In order to maintain a target bit rate, it is common for
a video encoder to modify the
quantisation parameter (scale factor or step size) during encoding. The change must
be
signalled to the decoder.
It isnotusuallydesirabletosuddenlychangethequantisation
parameter by a large amount during encodingof a video frame andso the parameter may be
encoded differentially from the previous quantisation parameter.
Example
Coded block pattern (CBP) indicates the blocks containing non-zero coefficients
inter-coded macroblock.
INumber of coefficients
non-zero
in an
block
in each
YO
Yl
Y2
Y3
Cr
Cb
I
I
I
CBP
llO100
011111
169
HUFFMAN CODING
Synchronisation markers
A video decoder may require to resynchronise in the eventof an error or interruption to the
stream of coded data. Synchronisation markers in the bit stream provide a means of doing
this. Typically, the differential predictions mentioned above (DC coefficient, motion vectors
so that the data after the
and quantisation parameter) are reset after a synchronisation marker,
marker may be decoded independentlyof previous (perhaps errored) data. Synchronisation is
supported by restart markers in JPEG, group of block (GOB) headers in baseline H.263 and
MPEG-4 (at fixed intervals within the coded picture) and slice start codes in the MPEG-1,
MPEG-2 and annexes to H.263 and MPEG-4 (at user definable intervals).
Higher-level headers
Information that applies to a complete frame or picture is encoded in a header (picture
of frames may also be encoded (for
header). Higher-level information about a sequence
example, sequence and group of pictures headers in MPEG-1 and MPEG-2).
8.3.1 TrueHuffmanCoding
In order to achieve the maximum compression
of a set of data symbols using Huffman
encoding, it is necessary to calculate the probabilityof occurrence of each symbol. A set of
variablelengthcodewordsisthenconstructed
for thisdataset.
This processwillbe
illustrated by the following example.
Probability
Vector
-1
- 0.5
0
0.014
0.024
0.117
0.646
0.5
0.101
0.027
0.0 16
1.5
1.S
log2(l/P)
6.16
5.38
3.10
0.63
3.31
5.21
5.97
8.1
170
ENTROPY CODING
Probability distribution of motion vectors
1
0.9
0.8
0.7
0.6
L.
._
.a
2 0.5
2
0.4
0.3
0.2
0.1
O L
-3
-2
-1
Figure 8.4
0
MVX or MVY
Distribution of motionvectorvalues
sequence and their information content, 10g2(1/P). To achieve optimum compression, each
value should be represented with exactly 10g2(llP) bits.
The vector probabilities are shown graphically
in Figure 8.4 (the solid line).0 is the most
common value and the probability drops sharply for larger motion vectors. (Note that there
are a small numberof vectors larger than+/ - 1.5 and so the probabilities in the table do not
sum to l.)
and assignthejoint
in increasingorder of probabilityand
171
HLTFFMAN CODING
Figure 8.5 Generating the Huffman code tree: Carphone motion vectors
The procedure isrepeated until there is a single root node that contains all other nodes and
data items listed beneath it. This procedure is illustrated in Figure
8.5.
0
Original list: The data items are shown as square boxes. Vectors ( - 1S ) and (1S ) have
the lowest probability and these are the first candidates for merging to form node A.
Stage 1: The newly created node A (shown as a circle) has a probability of 0.03 (from
the combined probabilities of ( - 1.5) and (1.5)) and the two lowest-probability items are
vectors ( - l ) and (1). These will be merged to form node B.
Stage 2: A and B are the next candidates for merging (to form C).
D.
Final tree: The data itemshave all been incorporated into a binary tree containing seven
data values and six nodes. Each data item is a leaf
of the tree.
2. Encoding
Each leaf of the binary tree is mapped to a VLC. To find this code, the tree is traversed
0 or 1 is
from the root node (F in this case) to the leaf (data item). For every branch, a
appended to the code:0 for an upper branch,1 for a lower branch (shown in thefinal tree of
of codes(Table 8.2). Encoding is achieved by
Figure 8.5). This gives the following set
transmittingtheappropriate
codeforeachdataitem.Note
thatoncethetree
has been
generated, the codes may be stored in a look-up table.
172
ENTROPY CODING
Bits CodeVector
(actual)
1
00
01 1
0
-
0.5
0.5
- 1.5
1.S
-1
1
5 0 1 000
01001
01010
0101 l
Bits (ideal)
0.63
3.1
3.3 1
6.16
5
5
5
5.97
5.38
5.21
2. No code contains any other code as a prefix,i.e. reading from the left-hand bit, each code
is uniquely decodable.
For example, the series of vectors (1, 0, 0.5) would be transmitted as follows:
3. Decoding
In order to decode the data, the decoder must have
a local copy of the Huffman code tree (or
look-up table). This may be achieved by transmitting the look-up table itself,or sending the
list of data and probabilities, prior to sendingthe coded data. Each uniquely decodable code
may then be read and converted back to the original data. Following the example above:
01011 is decoded as (1)
1 is decoded as (0)
01 1 is decoded as (0.5)
173
HUFFMAN CODING
9.66
0.001
1.5
8.38
0.003
0.0 18
0.953
0.5
1
1.5
0.001
-1
-
0.07
0.021
0.003
0.5
5.80
Figure8.6
9.66
HuffmantreeforClairemotionvectors
Huffman codes (shown in Table 8.4). There are still six nodes in the tree, one less than the
number of data items (seven): this is always the case with Huffman coding.
If the probability distributions are accurate, Huffman coding provides
a relatively compact
representation of the original data. In these examples, the frequently occurring (0) vector is
a
represented very efficiently as a single bit. However, to achieve optimum compression,
Table 8.4 Huffmancodes:Clairemotionvectors
0100
0.5
- 0.55.8
1
-1 8.38
- 1.59.66
010101 1.S9.66
1
00
01 1
1
2
3
4
0101 1
010100
0.07
5.57
5
6
6
8.38
174
ENTROPY CODING
separate code table is required for each of the two sequences Carphone and Claire. The
loss of potential compression efficiency due to the requirement for integral length codes is
very obvious for vector 0 in the Claire sequence: the optimum number
of bits (information
content) is 0.07 but the best that can be achieved with Huffman coding is 1 bit.
8.3.3Table
Design
The following two examples of VLC table design are taken from the H.263 and MPEG-4
standards. These tables are required for H.263 baseline coding and MPEG-4 short video
headercoding.
HUFFMAN CODING
175
Code
Run
0
0
0
0
1
2
0
0
3
4
5
0
l
1
0
0
0
0
0
0
0
0
0
6
7
8
9
Level
1
1
1
2
1
1
l
1
3
2
1
1
1
1
1
1
2
0
0
0
0
10
11
12
5
1
1
1
1
1
1
1
6
7
8
1
1
ESCAPE
1
1
...
10s
110s
1110s
1111s
0111s
01 101s
01 loos
0101 1s
010101s
010100s
01001 1s
0l0010s
010001s
0 10000s
001111s
001110s
001101s
001 loos
00101 11s
0010110s
0010101s
00lOlO0s
0010011s
0010010s
0010001s
00 10000s
000001 1s
...
176
ENTROPY CODING
000000000X (error)
000001 1 (escape)
1D
B
...19 codes
0010000 (1,8,1)
0010001 (1,7,1)
D
0
Start
0010010 (1,6,1)
0010011 (1,5,1)
0010100 (O,lZ,l)
0010101 ( 0 , l l . l )
0010110(0,10,1)
00101 11 (0,0.4)
001100 (1,4,1)
1
001101 (1,3,1)
001110(1.2,1)
001111 ( l , l , l )
010000 (0,9,1)
tiif-
010001 (0,8,1)
010010 (0,7,1)
010011 (0,6,1)
010100 (0,1.2)
010101 (0,0,3)
0101 1 (0,5,1)
01 100 (0,4,1)
01 101 (0,3.1)
0111 (l,O,l)
10 (O,O,1)
. 1 :::9
Figure 8.7
110 (O,l,l)
H.263/MPEG-4TCOEFVLCs
177
HUFFMAN CODING
Code
MVD
0
f0.5
-
0.5
+l
-1
+ 1.5
1.5
+2
-2
-
+ 2.5
- 2.5
+3
-3
+ 3.5
- 3.5
...
010
01 1
0010
001 1
00010
00011
0000110
000011 1
00001010
00001011
0000 1000
0000 1001
00000110
000001 11
...
...
where xk is a single bit. Hence there is one l-bit codeword; two 3-bit codewords; four 5-bit
codewords; eight 7-bit codewords; andso on. Table 8.7 shows the first 12 codes and these are
represented in tree form in Figure 8.9. The highly regular structure of the set of codewords
can be seen in this figure.
Any data element to be coded (transform coefficients, motion vectors, block patterns, etc.)
is assigned a code from the list of UVLCs. The codes are not optimised for a specific data
element (since the same set of codes is used for all elements): however, the uniform, regular
structure considerably simplifies encoder and decoder design since the same methods can be
used to encode or decode any data element.
8.3.4 EntropyCodingExample
This examplefollows the process of encoding and decoding a block of quantised coefficients
in an MPEG-4 inter-coded picture. Only six non-zero coefficients remain in the block: this
178
ENTROPY CODING
x2
X1
Codeword
X0
NIA
0
1
0
1
2
3
4
5
6
7
00 1
01 1
0000 1
00011
01001
01011
1
0
1
0
1
0
8
9
10
11
...
...
OOoooO1
0000011
0001001
0001011
0 100001
...
...
...
...39 codes
...10 codes
ooooo11o (3.5)
om001 1 1 (-3.5)
aoooO1ooO
oooo1001
-
b
1
oooO1010
T
A
m 1 1 1 (-2)
ooO10
(+l
0
~
010 (+0.5)
011 (-0.5)
(2.5)
ooO01011 (-2.5)
ooo0110 (+2)
o o O 1 1 (-1.5)
Start
(3)
(-3)
.5)
179
HUFFMAN CODING
...etc
0000001 (7)
...etc
0000011 (8)
...etc
0001001 (9)
...etc
001 (1)
ooo1011 (10)
0001 1 (4)
...etc
01oooo1 (11)
...etc
O l o o o l l (12)
01001 (5)
...etc
0101001 (13)
...etc
0101011 (14)
0101 1 (6)
011 (2)
- 1 (0)
180
ENTROPY CODING
wouldbecharacteristic
of eitherahighlycompressedblockorablockthathasbeen
efficiently predicted by motion estimation.
Quantised DCT coefficients (empty cells are
0):
4 - 1 0
2 - 3 0
0 - 1 0
o...
TCOEF variable length codes: (from Table 8.5: note that the last bit is the sign)
00101110; 101; 0101000; 0101011; 010111; 0011010
Transmitted bit sequence:
001011101010101000010l01l01011I00l10l0
Decoding of this sequence proceeds as follows. The decoder steps through the TCOEF tree
(shown in Figure 8.7) until it reaches the leaf 00101 11. The next bit (0) is decoded as the
sign and the (last, run, level) group (0, 0, 4) is obtained. The steps taken by the decoder for
this first coefficient are highlightedin Figure 8.10. The process is repeated with the leaf 10
followed by sign (1) and so on until a last coefficient is decoded. The decoder can now fill
the coefficient array and reverse the zigzag scan to restore the array
of 8 x 8 quantised
coefficients.
8.3.5 VariableLengthEncoderDesign
Sofiware design
A general approach to variable-length encoding in software is as follows:
181
HUFFMAN CODING
001OOOO
0010001
(1,8,1)
(1,7,1)
0010010(1,6,1)
0010011
0
Start
(1,5,1)
0010100
(0,12,1)
0010101
(O,ll,l)
0010110
(O,lO,l)
0010111 (0,0,4)
Figure8.10
f o r eachdatasymbol
findthecorrespondingVLCvalueandlength( i n b i t s )
packthisVLCintoanoutputregisterR
if the contents of RexceedLbytes
writeL (least significant) bytestotheoutputstream
s h i f t R by L b y t e s
Example
Using the entropy encoding example above, L = 1 byte, R is empty at start of encoding:
182
ENTROPY CODING
Thefollowingpackedbytesarewritten
totheoutputstream:00101110,01000101,
R still contains
10101 101,0010111 1. At the endof the above sequence, the output register
6 bits (001101). If encoding stops here, it willbe necessary to flush the contents of R to
the output stream.
The MVD codes listed inTable 8.6 can be stored in a simple look-up table. Only64 valid
MVD values exist and the contents of the look-up table are as follows:
[ index
1 [vlc] [ length ]
where [index] is a number in the range 0 . . .63 that is derived directly from MVD, [vlc] is the
variable length code paddedwith zeros and representedwith a fixed number of bits (e.g. 16
or 32 bits) and [length] indicates the number of bits present in the variable length code.
Converting (last, run, level) into the TCOEF VLCs listed
in Table 8.5 is slightly more
problematic. The 102 predetermined combinations of (last, run, level) have individualVLCs
assignedtothem(thesearethemostcommonlyoccurringcombinations)
and anyother
combination must be converted to an Escape sequence. The problem is that there are many
more possible combinations of (last, run, level) than there are individual VLCs. Run may
take any value between 0 and 62; Level any value between 1 and 128; and Last is 0 or 1.
This gives 16 002 possible combinations of (last, run, level). Three possible approaches to
finding the VLC are as follows:
1. Large look-up table indexed by (last, run, level). The size of this table may be reduced
somewhatbecauseonlylevelsin
therange 1-12 and runs in therange 0-40 have
individual VLCs. The look-up procedure is as follows:
i f ( \ l e v e l ] < 1 3 a n d r u n3<9 )
lookuptablebasedon (last, run, level)
returnindividualVLCor calculateEscape sequence
else
calculate Escape sequence
The look-up table has ( 2 x 4 0 12)
~ = 960 entries; 102 of these contain individual VLCs
and the remaining 858 contain a flag indicating that an Escape sequence is required.
2. Partitioned look-up tables indexedby (last, run, level). Basedon the valuesof last, runand
level, choose a smaller look-up table (e.g. a table that only applies when last =O). This
requires one or more comparisons before choosingtable
the but allows the large table to
be
as follows:
is
split intoa number of smaller tables with fewer entries overall. The procedure
i f ( l a s t , r u n , l e v e l ) E { s e t A}
l o o k up t a b l e A
returnVLCor calculateEscape sequence
e l s e i f ( l a s t , r u n , l e v e l ) E { s e t B}
l o o k up t a b l e B
returnVLCor calculateEscapesequence
....
else
calculate Escape sequence
CODING
HUFFMAN
183
For example, earlier versions of the H.263 test model software used this approach to
reduce the number of entries in the partitioned look-up tables to 200 (i.e. 102 valid VLCs
and 98 empty entries).
3. Conditional expression for every valid combination of (last, run, level). For example:
switch ( l a s t , run, level)
c a s e {A} : v l c = v A , l e n g t h = 1 A
c a s e { B } : v l c = v B , l e n g t h= lB
. . . ( 100 mor e c a s e s ) . . .
default : calculateEscapesequence
Comparing the three methods, method 1 lends itself to compact code, is easy to modify (by
changing the look-up table contents) and is likely be
to computationally efficient; however, it
requires a large look-up table, most of which is redundant. Method 3, at the other extreme,
requires the most code and is the most difficult to change (since each valid combination is
hand-coded) but requires the least data storage. On some platforms it may be the slowest
method. Method 2 offers a compromise between the other two methods.
Hardware design
Ahardwarearchitecture
for variablelengthencodingperformssimilartasks
to those
described above and an example is shown in Figure 8.1 1 (based on a design proposed by
Lei and Sun). A look-up unit finds the length and value of the appropriate VLC and passes
these to a pack unit. The pack unit collects together a fixed number of bits (e.g. 8, 16 or
32 bits) and shifts these out to a stream buffer. Within the pack unit, a counter records the
number of bits in the output register. When this counter overflows, a data word is output (as
in the example above) and the remaining upper bits in the output register are shifted down.
The design of the look-up unit is critical to the size, efficiency and adaptability of the
design. Options range from a ROM or RAM-based look-up table containing all valid codes
plusdummyentriesindicatingthatan
Escape sequenceisrequired,toa
Hard-wired
approach (similar to the switch statement described above) in which each valid combinaand length fields. This approachissometimes
tion ismapped to theappropriateVLC
described as aprogrammablelogic
array (PLA)look-up table. Another example of a
hardware VLE is presented e l ~ e w h e r e . ~
table
VLC
select
calculate VLC
Figure 8.11 Hardware VLE
byte or
word stream
184
ENTROPY CODING
8.3.6 VariableLengthDecoderDesign
Software design
The operation of a decoder for VLCs can be summarised as follows:
i f ( f i r s t b i t= 1)
i f ( s e c o n d b i t = 1)
i f ( t h i r d b i t = 1)
if (fourthbit=l)
r e t u r n (0,0,2)
else
r e t u r n (0,2,1)
else
r e t u r n (O,l,l)
else
r e t u r n (0,0,1)
else
_ _ _ decode a l l VLCs s t a r t i n g w i t h 0
This approach requires a large nested if. . . else statement (or equivalent) that can deal with
104cases(102uniqueTCOEFVLCs,oneescapecode,plusanerrorcondition).This
method leads to a large code size, may be slow to execute and is difficult to modify (because
the Huffman tree is hand-coded into the software); however, no extra look-up tables are
required.
An alternative is touse one or more look-up tables. The maximum lengthof TCOEF VLC
(excluding the sign bit and escape sequences) is 13 bits. We can construct a look-up table
whose index is a 13-bit number (the 13 Isbs of the input stream). Each entry of the table
contains either a (last, run, level) triplet or a flag indicating Escape or Error; 213= 8192
entries are required, most of which will be duplicates of other entries. For example, every
code beginning with 10. . . (starting with the Isb) decodes to the triplet (0, 0, 1).
An initial test of the range of the 13-bit number maybe used to select one of a number of
smaller look-up tables. For example, the H.263 reference model decoder described earlier
breaks the table into three smaller tables containing around 300 entries (about 200
of which
are duplicate entries).
185
HUFFMAN CODING
inputd
1 4
input
register
bitstream
one or
Ezrrbits
Shift
::1
Find VL
code
data unit
Hardware design
Hardware designs for variable length decoding fall into two categories: (a) those thatdecode
n bits from the input stream every m cycles (e.g. decoding 1 or 2bits per cycle) and (b) those
that decode n complete VL codewords every m cycles (e.g. decoding 1 codeword in one or
two cycles). The basic architecture of a decoder is shown in Figure 8.12 (the dotted line
code length L is only required for category (b) decoders).
Category(a), n bits per m cycles Thistype of decoderfollowsthrough
the Huffman
decoding tree. The simplestdesignprocesses
one level of the tree every cycle: this is
analogous to the large
if. . .else statement described above. The shift register shown in
Figure 8.12 shifts 1 bit per cycle to the Find VL code unit. This unit steps through the tree
(basedon the value of eachinput bit) until a valid code (aleaf)isfoundand
can be
implemented with a finite state machine (FSM) architecture.For example, Table 8.8 lists part
of the FSM for theTCOEF tree shown in Figure 8.7. Each state corresponds to a node of the
Huffmantreeand
the nodes in thetablearelabelled(withcircles)
in Figure 8.13 for
convenience.
There are 102 nodes (and hence 102 states in the FSM) and 103 output values. To decode
l 1 10, for example, the decoder traces the following sequence:
State 0
State 2
State 5
State 6
output (0, 2, 1)
Hence the decoder processes 1 bit per cycle (assuming that a state transition occursper clock
cycle).
186
ENTROPY CODING
stateNextInputState
or output
0
1
0
1
0
1
0
1
0
1
Thistype of decoderhasthedisadvantagethattheprocessingratedepends
on the
(variable) rate of the coded stream. It is often more useful to be capableof processing one or
morecompleteVLCsperclockcycle(forexample,toguaranteeacertaincodeword
throughput), and this leads to the second category of decoder design.
Category (b), n codewords per m cycles This is analogous to the large look-up table
approach in a software decoder. K bits (stored in the input shift register) are examined per
in the example of
cycle, where K is the largest possible VLC size (13, excluding the sign bit,
H.263MPEG-4 TCOEF). TheFind VL code unitin Figure 8.12 checks all combinationsof
K bits and finds a matching valid code, Escape code or flags an error. The length
of the
matching code (L bits) is fed back and the shift register shifts the input data by L bits (i.e.
L bitsareremovedfromtheinputbuffer).HenceacompleteL-bitcodewordcan
be
processed in one cycle.
The shift register can be implemented using a barrel shifter (a shift-register circuit that
shifts its contents by L places in one cycle). The Find VL code unit may be implemented
using logic (a PLA). The logic array should minimise effectively sincemost of the possible
inputcombinationsaredontcares.
In theTCOEFexample,
all 13-bitinputwords
(0, 0, l). It is also possible to implement this
IOXXXXXXXXXXX map to the output
unit as a ROM or RAM look-up table with 213 entries.
A decoder that decodes one codeword per cycle is described by Lei and Sun2 and Chang
and Me~serschmitt~
examine the principles of concurrent VLC decoding. Further examples
of VL decoders can be found elsewhere.526
187
HUFFMAN CODING
0
1
0
Start
1
c9
I
11 11 (0,0,2)
decoding errors may continue to occur (propagate) until a resynchronisation point occursin
the bit stream. The synchronisation markers describedin Section 8.2.2 limit the propagation
of synchronisation markers in the bit
of errors at the decoder. Increasing the frequency
streamcanreducetheeffect
of anerroronthedecodedimage:however,markersare
redundant overhead and so this also reduces compression efficiency. Transmission errors
and their effect on coded video are discussed further in Chapter 11.
Error-resilient alternatives to modified Huffman codes have been proposed. For example,
MPEG-4 (video) includes an option touse reversible variable length codes(RVLCs), a class
of codewords that may be successfully decoded in either a forward or backward direction
from a resynchronisation point. Whenan error occurs, it is usually detectable
by the decoder
(since a serious decoder error is likely to violate the encoding syntax). The decoder can
decode the current section of data in both directions, forward from the previous synchronisation point and backward from the next synchronisation point. Figure 8.14 shows an
example. Region (a) is decoded and then an error is identified. The decoder skips to the
188
ENTROPY CODING
Header
Decoded (a)
Decoded (b)
mbn
Synchronisation
marker
an error is detected
Example
Table 8.9 lists five motion vector values ( - 2 , - 1, 0, 1, 2 ) . The probability of occurrence of
each vector is listed in the second column. Each vector is assigned a subrange within the
Table 8.9 Subranges
Probability
Vector
-2
-1
0
1
2
Iogd 1/ P )
0.1
0.2
0.3-0.7 0.4
0.7-0.9 0.2
0.1
0.9-1
3.32
2.32
1.32
2.32
3.32
Subrange
0-0.1
0.1-0.3
.O
0.7
189
ARITHMETIC CODING
Encoding procedure
Subrange
Range
Encoding
procedure
(L
+
+
H)
Symbol
(L
H)
Notes
1.0
(0)
0.3 + 0.7
0.3 + 0.7
(-
0.34
1)0.1
+ 0.3
0.42
(0)
0.3 + 0.7
0.364 is 30% of the range;
0.396 is 70% of the range
0.364 + 0.396
(2)
0.9
0.3928
1.0
0.396
Total range
t
0.3
0.1
190
CODING
ENTROPY
(0-1) is
the range 0.3928-0.396: for example, 0.394. Figure 8.16 shows how the initial range
as eachdatasymbolisprocessed.After
progressivelypartitionedintosmallerranges
encoding the first symbol (vector 0), the new range is (0.3, 0.7). The next symbol (vector -1)
selects the subrange (0.34, 0.42) which becomes the new range, andso on. The final symbol
(vector +2) selects the subrange (0.3928, 0.396) and the number 0.394 (falling within this
as a fixed-point fractional number using 9
range) is transmitted. 0.394 can be represented
bits, i.e. our data sequence (0, - l , 0, 2) is compressed to a 9-bit quantity.
Decoding procedure
The sequence of subranges (and hence the sequence of data symbols) can be decoded from
this number as follows.
Decoding procedure
symbol
Decoded
Subrange
Range
---f
(0)
0.7
0.34
0.42
0.42
0.364
0.396
0.396
4
0.396
(2)
The principal advantage of arithmetic coding is that the transmitted number (0.394 in this
case, which can be represented as a fixed-point number with sufficient accuracy using 9 bits)
is not constrained to an integral number of bits for each transmitted data symbol.To achieve
optimal compression, the sequence of data symbols should be represented with:
191
ARITHMETIC CODING
0
4,
0.3
0.9
0.1
0.34
0.42
(2)
Figure8.16
Arithmeticcodingexample
Probability distributions
As with Huffman coding, it is not always practical to calculate symbol probabilities prior to
coding. In several video coding standards(e.g. H.263, MPEG-4, H.26L), arithmetic coding is
provided as anoptionalalternativetoHuffmancodingandpre-calculatedsubranges
are defined by thestandard(basedontypicalprobabilitydistributions).This
has the
advantage of avoiding the need to calculate and transmit probability distributions, but the
disadvantage that compression will be suboptimal for a video sequence that doesnot exactly
follow the standard probability distributions.
Termination
In our example, we stopped decoding after four steps.
However, there is nothing containedin
the transmitted number (0.394) to indicate the number of symbols that must be decoded: it
could equally be decoded as three symbolsor five. The decoder must determine when to stop
decoding by some other means. In thearithmeticcodingoption
specified in H.263, for
example, the decoder can determinethenumber of symbolstodecodeaccordingtothe
syntax of the coded data. Decodingof transform coefficients in a block halts when an end-ofblock code is detected. Fixed-length codes (such as picture start code) are included in the bit
stream and these will force the decoder to stop decoding (for example, if a transmission
error has occurred).
192
ENTROPY CODING
Fixed-point arithmetic
Floating-point binary arithmetic is generally less efficient than fixed-point arithmetic and
some processors do not support floating-point arithmetic at all. An efficient implementation
with fixed-point arithmetic can be achieved by specifying the subranges as fixed-precision
binary numbers. For example, in H.263, each subrange is specified as
an unsigned 14-bit
integer(i.e.atotalrange
of 0-16383). Thesubrangesforthe
differentialquantisation
parameter DQUANT are listed as an example:
-2
Subrange
0-4094
19 4095-8 1
8192-12286
12287-16383
Incremental encoding
Asmoredatasymbolsareencoded,
theprecision of thefractionalnumberrequiredto
represent the sequence increases. It is possible for the number to exceed the precision
of the
processor after a relatively small numberof data symbols and a practical arithmetic encoder
must take steps to ensure that this does not occur. This can be achieved by incrementally
encoding bits of the fractional number as they are identified by the encoder.In our example
above, after step 3, the range is 0.364-0.396. We know that the final fractional number will
(e.g. 0.3, or its binary
begin with 0.3.. . and so we can send the most significant part
equivalent) without prejudicing the remaining calculations. At the same time, the limits of
the range are left-shifted to extend the range. In this way, the encoder incrementally sends
themostsignificantbits
of thefractionalnumberwhilstcontinuallyreadjustingthe
boundaries of the range to avoid arithmetic overflow.
Patent issues
A number of patents have been filed that cover aspects of arithmetic encoding (such as
IBMsQ-coderarithmeticcoding
algorithm). It is not entirelyclearwhetherthe
arithmetic coding algorithms specified in the image and video coding standards are covered
by patents. Some developers of commercial video coding systems have avoided the use of
arithmeticcodingbecause
of concernsaboutpotentialpatentinfringements,despiteits
potential compression advantages.
8.5
SUMMARY
An entropy coder maps a sequence of data elements to a compressed bit stream, removing
statistical redundancy in the process.
In a block transform-based video CODEC, the main
REFERENCES
193
data elements are transform coefficients (run-level coded to efficiently represent sequences
of zero coefficients), motion vectors (which may be differentiallycoded)andheader
information. Optimum compression requires the probability distributions of the data to be
analysed prior to coding; for practical reasons, video CODECs use standard pre-calculated
look-up tables for entropy coding.
The two most popularentropycodingmethodsfor
video CODECsare
modified
VLC) and arithmetic
Huffman coding (in whicheachelementismappedtoaseparate
Huffman
coding(in which a series of elements arecoded to formafractionalnumber).
encoding may be carried using a series of tablelook-upoperations; a Huffmandecoder
identifies each VLC and this is possible because the codes are designed such that no code
forms the prefix of any other. Arithmetic coding is carried out by generating and encoding a
fractional number to represent a series of data elements.
This concludes the discussion of the main internal functions of a video CODEC (motion
estimation and compensation, transform coding and entropy coding). The performance of a
CODEC in a practical video communication system can often be dramatically improved by
filtering the source video (pre-filtering) and/or the decoded video frames (post-filtering).
REFERENCES
1. D. A. Huffman, A method for the construction of minimum-redundancy codes,Proceedings ofthe
Institute of Electrical and Radio Engineers, 40(9), September 1952.
2. S. M. Lei and M-T. Sun, An entropy coding system for digital HDTV applications, IEEE Trans.
CSW, 1(1), March 1991.
3. Hao-ChiehChang,Liang-GeeChen,Yung-ChiChangandSheng-ChiehHuang,
A VLSI architecture design of VLC encoder for high data rate videohmage coding, 1999 IEEE International
Symposium on Circuits and Systems (ISCAS 99).
4. S. F. Chang and D. Messerschmitt, Designing high-throughput VLC decoder, Part I-concurrent
VLSI architectures, IEEE Trans. CSVT, 2(2),June 1992.
5 . J. Jeon, S. Park and H. Park, A fast variable-length decoder using plane separation, IEEE Trans.
CSVT, 10(5),August 2000.
6. B-J. Shieh, Y-S. Lee and C-Y. Lee,A high throughput memory-based VLC decoder with codeword
boundary prediction, IEEE Trans. CSW, lo@),December 2000.
7. A. KopanskyandM.Bystrom,SequentialdecodingofMPEG-4codedbitstreamsforerror
resilience, Proc. Con$ on Information Sciences and Systems, Baltimore, 1999.
8. J. Wen and J. Villasensor, Utilizing soft information in decoding of variable length codes, Proc.
IEEE Data Compression Conference, Utah, 1999.
9. S. KaiserandM.Bystrom,Softdecodingofvariable-lengthcodes,
Proc. IEEE International
Communications Conference, New Orleans, 2000.
IO. I. Witten, R. Neal and J. Cleary, Arithmetic coding for data compression, Communications ofthe
ACM, 30(6), June 1987.
1 1. J. Mitchell and W. Pennebaker, Optimal hardware and software arithmetic coding procedures for
the Q-coder, IBM Journal of Research und Development, 32(6), November 1988.
9.2 PRE-FILTERING
DCT-based compression algorithms can perform well for smooth, noise-free regions
of images. A region with flat texture or a gradual variation in texture (like the face area
of the image in Figure 9.3) produces a very small number of significant DCT coefficients and
hence is compressed efficiently. However, to generate a clean video image like Figure 9.3
requires good lighting, an expensive camera and a high-quality video capture system. For
most applications, these requirements are impractical. A typical desktop video-conferencing
scenario might involve a low-cost camera 011 top of the users monitor, poor lighting and a
196
busy background, and all of these factors can be detrimental to the quality of the final
image. A typical source image for this type of application is shown in Figure 9.1. Further
difficulties can becaused for motion video compression: for example,a hand-held camera or
a motorised surveillance camera are susceptible to camera shake which can significantly
reduce the efficiency of motion estimation and compensation.
Camera
Figure9.2
197
PE-FILTERING
Figure 9.4
198
P m -AND POST-PROCESSING
Figure 9.5
9.2.2 CameraMovement
Unwanted camera movements (camera shake orjitter)areanothercause
of poor
compression efficiency. Block-based motion estimation performs best when the camera is
fixed in one position or when it undergoes smooth linear movement (pan or tilt). In the case
of a hand-held camera, or a motorised padtilt operation (e.g. as a surveillance camera
sweeps over a scene), the image tends to experience random jitter between successive
frames. If the motion search algorithmdoes not detect this jitter correctly, the resultis a large
residual frame after motion compensation. This in turn leads to a larger number of bits in
the coded bit stream and hence poorer compression efficiency.
Example
Two versions of a short 10-frame video sequence (the first frame is shown in Figure 9.6)
areencoded with MPEG-4 (simple profile, with half-pixelmotionestimation
and
POST-FILTERING
199
compensation). Version l (the original sequence) has a fixed camera position. Version 2 is
identical except that
2 of the 10 frames are shifted horizontally
or vertically by up
to 2 pixels (to
simulate camera shake). The sequences are coded with H.263, using a fixed quantiser step
size (10) in each case. For Version l (the original), the encoded sequence is 18 703 bits.
For Version 2 (with shaking of two frames), the encoded sequence is 29 080 bits: the
compression efficiency drops by over 50% due to a small displacement within 2 of the
10 frames. Thisexample shows that camera shake canbe very detrimental to video
compression performance (despite the fact that the encoder attempts to compensate for the
motion).
The compression efficiency may be increased (and the subjective appearance of the video
sequence improved) with automatic camera stabilisation. Mechanical stabilisation is used
in some hand-held cameras, but this adds weight and bulk to the system. Electronic image
stabilisation can be achieved without extra hardware (at the expense of extra processing). For
example, onemethod attempts to stabilise the video frames prior to encoding. In this
approach, a matching algorithm is used to detect global motion (i.e. common motion of all
background areas, usually due to camera movement). The matching algorithm examines
areas near the boundary of each image (not the centre of the image-since the centre usually
contains foreground objects). If global motion is detected, the image is shifted to compensate
for small, short-term movements due to camera shake.
9.3 POST-FILTERING
9.3.1Image
Distortion
Blocking
Often, the most obvious distortion or artefact is the appearance of regular square blocks
superimposed on the image. These blocking artefacts are a characteristic of block-based
transform CODECs, and their edges are aligned with the 8 x 8 regions processed via the
DCT. There are two causes of blocking artefacts: over-quantisation of the DC coefficient and
suppression or over-quantisation oflow frequency AC coefficients. TheDC coefficient
200
Figure 9.7
corresponds to the average (mean) value of each 8 x 8 block. In areas of smooth shading
(such as the face area
in Figure 9.7),over-quantisation of the DC coefficient means that there
is a large change in level between neighbouring blocks. When two blocks with similar
shades are quantized to different levels, thereconstructed blocks can have a larger jump in
level and hence a visible change of shade. This is most obvious at the block boundary,
appearing as a tiling effect on smooth areas of the image. A second cause of blocking is
over-quantisation or elimination of significant AC coefficients. Where there should be a
smooth transition between blocks,a coarse reconstruction of low-frequency basis patterns
(see Chapter 7) leads to discontinuitiesbetween block edges. Figure 9.9 illustrates these
two
blocking effects in one dimension. Image sample amplitudes fora flat region are shown on
the left and for a smoothly varying region on the right.
Ringing
High quantisation can have a low-pass filtering effect, since higher-frequency AC coefficients tend to be removed during quantisation. Where there are strong edges in the original
image, this low-pass effect can cause ringing or ripples near the edges. This is analogous
to the effectof applying a low-pass filter to a signal with a sharp change in amplitude: lowfrequency ringingcomponentsappearnearthechange
position. This effect appears in
Figure 9.8 as ripples near the edge of the hat.
POST-FILTERING
201
contributes to the blocking effect (in this case, there is a sharp change between the strong
basis pattern and the neighbouring blocks).
These three distortion effectsdegrade the appearance of decoded images or video frames.
Blocking is particularly obvious because the large 8 x 8 patterns are clearly visiblein highly
compressed frames. The artefacts can also affect the performance of motion-compensated
video coding. A video encoder that uses motion-compensated prediction forms a reconstructed (decoded) version of the current frame asa prediction reference for further
encoded
frames: this ensures that the encoderand decoder use identical reference framesand prevents
'drift' at the decoder. However, if a high quantiser scale is used, the reference frame at the
encoder will contain distortion artefacts that were not present in the original frame. When
202
(a) DC coefficient
quantisation
(b) AC coefficient
quantisation
Amplitude
Original
levels:
Amplitude
I
Block A
Block B
Block A
Block B
Amplitude
Amplitude
Reconstructed
levels:
I
Block A
Block B
Block A
Block B
the reference frame (containing distortion) is subtracted from the next input frame (without
distortion), these artefacts willtend to increase the energy in the motion-compensated
residual frame, leading to a reduction in compression efficiency. This effect can produce a
significant residual component even when there is no change between successive frames.
Figure 9.10 illustrates this effect. The distorted reference frame (a) is subtracted from the
current frame (b). There is no change in the image content but the difference frame (c)
clearly contains residual energy (the speckled effect). This residual energy will be encoded
and transmitted, even though there is no real change in the image.
It is possible to design post-filters to reduce the effect of these predictable artefacts.
The goal is to reduce the strength of a particular type of artefact without adversely
affecting the important features oftheimage
(such as edges). Filters can be classified
according to the type of artefact they are addressing (usually blocking or ringing), their
computational complexity and whether they are applied inside or outside the coding loop.
A filter applied after decoding (outside the loop) can be made independent of the CODEC:
however, good performance can be achieved by making use of parameters from the video
decoder. A filter applied to the reconstructed frame within the encoder (inside the loop) has
the advantage of improving compression efficiency (as described above) but must also be
applied within the decoder. The use of in-loop filters is limited to non-standard CODECs
except in the case of loop filtersdefinedin
the coding standards. Post-filters can be
categorised as follows, depending on their position in the coding chain.
( a ) In-loop filters
The filter is applied to the reconstructed frame bothin the encoder andinthedecoder.
Applying the filter within the encoder loop can improve the quality of the reconstructed
(c)
frame
difference
(c)
204
P M -AND POST-PROCESSING
Subtract
Current
frame
Image encoder
Encoded
frame
Motion
estimation
vectors
In-loop
filter
4
Previous
frame@)
]e
image
Add
Encoded
frame
Image decoder
decoder
Decoded
frame
l/-----
filter
frame(s)
205
POST-FILTERING
Decoding parameters
decoder
frame
Video
Post-filter
Filtered
( c ) Decoder-independent filters
In order to minimise dependence on the decoder, the filter may be applied after decoding
without any knowledge of decoder parameters, as illustratedin Figure 9.13. This approach
gives the maximum flexibility (for example, the decoder and the filter may be treated as
separate black boxes by the system designer).However, filter performance is generallynot
decoder
frame
Video
Post-filter
Filtered
206
Filtered pixels
as good as decoder-dependentfilters, since the filter has no information about the coding of
each block.
9.3.2 De-blockingFilters
Blocking artefactsare usually the most obvious and therefore the most important to
minimise through filtering.
In-loop jilters
It is possible to implement a non-standard in-loop de-blocking filter, however, the use of
such a filter is limited to proprietary systems. Annex J of the H.263+ standard defines an
optional de-blocking filter that operateswithin the encoder and decoder loops. Al-D filter
is applied across blockboundaries as shown in Figure 9.14. Four pixel positions at a timeare
smoothed across the blockboundaries, first across the horizontal boundariesand then across
the vertical boundaries. The strength of the filter (i.e. the amount of smoothing applied to
the pixels) is chosen depending on the quantiser value (as described above). The filter is
effectively disabled if there is a strong discontinuity between the values of A and B or
between the values of C and D: this helps to prevent filtering of genuine strong horizontal
or vertical edges in the original picture. In-loop
de-blocking filters have been compared2 and
the authors conclude that the best performance is given by POCS algorithms (described
briefly below).
207
POST-FILTERING
Figure 9.15
9.3.3 De-ringingFilters
After blocking, ringing is often the next most obvious type of coding artefact. De-ringing
filters receive somewhat less attention than de-blocking filters. MPEG-4 Annex F describes
an optional post-decoder de-ringing filter. In this algorithm, a threshold thr is set for each
reconstructed block based on the mean pixel value in the block. The pixel values within the
block are compared with the threshold and 3 x 3 regions of pixels that are all eitherabove or
below the threshold are filtered using a 2-D spatial filter. This has the effect of smoothing
homogeneous regions of pixels on either side of strong image edges whilst preserving the
edges themselves: it is these regions that are likely to be affected by ringing. Figure 9.16
208
P E - AND POST-PROCESSING
shows an example of regions of pixels that may be filtered in thisway in a block containing a
strong edge. In this example,pixels adjacent to the edgewill be ignored by the filter (hence
preserving the edge detail). Pixels relatively
in
flat regions on either sideof the edge (which
are likely to contain ringing) will be filtered.
9.3.4 ErrorConcealmentFilters
A final category of decoder filter isthat of error concealmentfilters. When a decoder detects
that a transmission error has occurred, itis possible to estimate the area of the frame that is
likely tobe corrupted by the error. Once the areais known, a spatial or temporalfilter may be
applied toattempttoconcealthe
error. Basic errorconcealment filters operate by
interpolating from neighbouring error-free regions (spatially and/or temporally) to cover
the damaged area. More advanced methods (such as POCS filtering, mentioned above)
attempt tomaintainimagefeaturesacross
the damaged region.Errorconcealmentis
discussed fuaher in Chapter 11.
9.4
SUMMARY
Re- and post-filtering can be valuable tools fora video CODEC designer. The goal of a prefilter is to clean up the source image and compensate for imperfections such as camera
noise and camera shake whilst retaining visually important image features.A well-designed
pre-filter can significantly improve compression efficiency by reducing the number of bits
spent on coding noise. Post-filters are designed to compensate for characteristic artefacts
introduced by block-based transform coding such as blocking and ringing effects. A postfilter can greatly improve subjective visual quality, reducing obviousdistortions whilst
retaining important features in the image. There are threemain classes of this type of filter:
loop filters (designed to improvemotion compensation performance aswell as image quality
REFERENCES
209
and present in both encoder and decoder), decoder-dependent post-filters (which make
use of
decoded parameters to improve filtering performance) and decoder-independent post-filters
(which are independent of the coding algorithm but generally suffer from poorer performance than the other types). As with many other aspects of video CODEC design, there is
usually a trade-off between filter complexity and performance (in terms
of bit rate andimage
quality).Therelationshipbetweencomputationalcomplexity,codedbitrateandimage
quality is discussed in the next chapter.
REFERENCES
1. R. Kutka, Detection of image background movement as compensation for camera shaking with
mobile platforms, Proc. Picture Coding Symposium PCSOl, Seoul, April 2001.
2. M. Yuen and H. R. Wu, Performance comparisonof loop filtering in generic MC/DPCM/DCT video
coding, Proc. SPIE Digital Video Compression, San Jose, 1996.
3. Y. Yang, N. Galatsanos and A. Katsaggelos, Projection-based spatially adaptive reconstruction of
block transform compressed images, IEEE Trans. Image Processing, 4, July 1995.
4. Y. Yang and N. Galatsanos, Removal of compression artifacts using projections onto convex sets
and line modeling, IEEE Trans. Image Proc-essing, 6, October 1997.
5. B. Jeon, J. Jeong and J. M. Jo, Blocking artifacts reduction in image coding based on minimum
block boundary discontinuity, Proc. SPIE VCIP9.5, Taipei, 1995.
6. A. Nostratina, Embedded post-processing for enhancement
of compressed images, Proc. Data
Compression Conference DCC-99, Utah, 1999.
7. J. Chou, M. Crouse and K. Ramchandran, A simple algorithm for removing blocking artifacts in
block transform coded images, IEEE Signal Processing Letters, 5, February 1998.
8. S. Hong, Y. Chan and W. Siu, A practical real-time post-processing technique for block effect
elimination, Proc. IEEE ICIP96, Lausanne, September 1996.
A simple algorithm for the reduction
of blocking artifacts in
9. S. Marsi, R. Castagno and G. Ramponi,
images and its implementation, IEEE Trans. on Consumer Electronics, 44(3), August 1998.
10. T. Meier, K. Ngan and G. Crebbin, Reductionof coding artifactsat low bit rates, Proc. SPIE Visual
Communications and Image Processing, San Jose, January 1998.
11. Z. Xiong, M. Orchard, Y.-Q. Zhang, A deblocking algorithm for JPEG compressed images using
overcomplete wavelet representations, IEEE Trans. CSVT, 7,April 1997.
l0
Rate, Distortion and Complexity
10.1 INTRODUCTION
The choice of video coding algorithm and encoding parameters affect the coded bit rate and
the quality of the decoded video sequence (as well as the computational complexity of the
video CODEC). The precise relationship between coding parameters, bit rate and visual
quality varies depending on the characteristics of the video sequence (e.g. noisy input vs.
clean input; high detail vs. low detail; complex motion vs. simple motion). At the same
time, practical limits determined by the processor and the transmission environment put
constraints on the bit rate and image quality that may be achieved. It is important to control
thevideoencoding
process in order to maximisecompressionperformance
(i.e. high
compression and/or good image quality) whilst remaining within the practical constraints
of transmission and processing.
Rate-distortion optimisation attempts to maximise image quality subject to transmission
bit rate constraints. Thebest optimisation performance comesat the expense of impractically
high computation. Practical algorithmsfor the control of bit rate can be judged accordingto
how closely they approach optimum performance. Many alternative rate control algorithms
exist; sophisticated algorithms can achieve excellent rate-distortion performance, usually at
a cost of increased computational complexity. The careful selection and implementation of a
rate control algorithm can make a big difference to video CODEC performance.
Recent trends in software-only CODECs and video coding inpower-limited environments
(e.g. mobile computing) mean that computational complexity is an important factor in video
CODECperformance.In
many application scenarios,videoquality
is constrained by
availablecomputationalresources
as well as or instead of available bit rate. Recent
developments in variable-complexity algorithms (VCAs) forvideocodingenablethe
developer to manage computational complexity and trade processing resources for image
quality. This leads to situations in which rate, complexity and distortion are interdependent.
New algorithms are required to jointly control bit rate and computational complexity whilst
minimising distortion.
In this chapter we examine the factors that influence rate-distortion performance in a
video CODEC and discuss how these factors can be exploitedto efficiently control coded bit
rate. We describe a number of popular algorithms for rate control. We discuss the relationship between computation, rate and distortion and show how new VCAs are beginning to
influence the design of video CODECs.
212
COMPLEXITY
DISTORTION
AND
RATE,
Some examples of bit rate profiles are given below. Figure 10.1 plots the number of bits in
each frame for a videosequenceencoded
using Motion JPEG.Eachframeiscoded
independently (intra-coded) and the bit rate for each frame does not change significantly.
Small variations in bit rate are due to changes in the spatial content of the frames in the
10-frame sequence. Figure 10.2 shows the bit rate variation for the same sequence coded
WPEG
Frarn?
213
H263
Fram
20 000
5000
0
0
Fram
214
DISTORTION
AND
RATE,
COMPLEXITY
These examplesshow
that thechoice
of algorithmand
the content of thevideo
sequence affect the bit rate (and also
the visual quality) of the coded sequence. At the
same time, the operating environment places important constraints on bit rate. These may
include:
Examples:
DVD-video The mean bit rate is determined by the duration of the video material. For
example, if a 3-hour movie is to bestored on a single 4.7 Gbyte DVD, then the mean bit rate
(for thewholemovie)
must not exceedaround 3.5 Mbps. The maximum bit rate is
determined by the maximum transfer rate from the DVD and the throughput of the video
decoder. Bit-rate variation (subject to these constraints) and latency are not such important
issues.
Video conferencing over ISDN
The ISDN channel operates at a constant bit rate (e.g.
128 kbps). The encoded bit rate must match this channel rate exactly, i.e. no variation is
allowed. The output of the video encoder is constant bit rate (CBR) coded video.
Videoconferencing
over a packet-switchednetwork
The situation here is more
complicated. The availablemean and maximum bit rate may vary, depending on the network
routeing and on the volume of other traffic. In some situations, latency and bit rate may be
linked, i.e. a higher data rate may cause increased congestion and delay inthe network. The
video encoder can generate CBR or variable bit rate (VBR) coded video, but the mean and
peak data rate may depend on the capacity of the network connection.
Each of these applicationexampleshas
different requirements in terms of the rate of
encodedvideo data. Rate control, the process of matchingtheencoderoutputtorate
constraints, is a necessary component of the majority of practical video coding applications.
The rate control problem is
defined below in Section 10.2.3. Therearemany different
approaches to solving thisproblem and in a given situation, the choiceof rate control method
can significantly influence videoqualityatthe
decoder. Poor rate control may cause a
number of problems such as low visual quality, fluctuations in visual quality and dropped
frames leading to jerky video.
In the next section we will examine the relationship between coding parameters, bit rate
and visual quality.
215
DISTORTION
ANDBIT RATE
10.2.2 Rate-DistortionPerformance
A lossless compression encoder produces a reduction in data rate with no loss of fidelity of the
original data. A lossy encoder, on the other hand, reduces data rateat the expense of a loss of
quality. As discussed previously,significantlyhigher compression of image and
video data can be
achieved using lossy methods than with lossless methods.The output of a lossy video CODEC
is a sequence of images that are of a lower quality than the original images.
The rate-distortion petformance of a video CODEC provides a measure
of the image
quality produced at a range of coded bit rates. For a given compressed bit rate, measure
the distortion of the decoded sequence (relative to the original sequence). Repeat this for a
range of compressed bit rates to obtain the rate-distortion curve such as the example shown
in Figure 10.4. Each point on this graphis generated by encoding a video sequence using an
MPEG-4 encoder with a different quantiser step size
Q. Smaller values of Q produce a
higher encoded bit rate and lower distortion; largervalues of Q produce lowerbit rates at the
expense of higher distortion. In this figure, image distortion is measured by peak signal to
noise ratio (PSNR), describedin Chapter 2. PSNR is a logarithmic measure, and
a high value
of PSNR indicates low distortion.Thevideosequenceis
a relatively static,head-andshoulders sequence (Claire). The shape of the rate-distortion curve is very typical: better
imagequality(asmeasured
by PSNR)occursathigher
bit rates,and the qualitydrops
sharply once the bit rate is below a certain threshold.
The rate-distortion performance of a video CODEC may be affected by many factors,
including the following.
Video material
Under identical encoding conditions, the rate-distortion performancemay vary considerably
depending on the video material that is encoded. Figure 10.5 compares the rate-distortion
42 l
28
~~
~..
. . . .
~~
...
I
o
IO
20
30
40
50
M)
70
80
90
Figure
Rate (kbps)
10.4
example
Rate-distortion curve
216
301
:p
,
.,-.
.___
28
50
100
150
200
250
Rate (kbps)
300
350
400
450
performance of two sequences, Claire and Foreman, under identical encoding conditions
(MPEG-4, fixed quantiser step sizevarying from 4 to 24). The Foreman sequence contains
a lot of movement and detail and is therefore more difficult to compress than Claire. At
the same value of quantiser, Foreman tends to have a much higher encoded bit rate and a
higher distortion (lower PSNR) than Claire. The shape of the rate-distortion curveis
similar but the rate and distortion values are very different.
Encoding parameters
In a DCT-based CODEC, a number of encoding parameters (in addition to quantiser step
size) affect the encoded bit rate. An efficient motion estimation algorithm produces a small
residualframeaftermotioncompensationandhence
a low coded bit rate; intra-coded
macroblocks usually requiremore bits than inter-codedmacroblocks; sub-pixel motion
compensation produces a lower bit rate than integer-pixel compensation; and so on. Less
obvious effects include, for example, the intervals at which the quantiser step size is varied
during encoding. Each time the quantiser step size changes, the new value (or the change)
must be signalled to the decoder and this takes more bits (and hence increases the coded
bit rate).
Encoding algorithms
Figures 10.1-10.3 illustrate how the coded bit rate changes depending on the compression
algorithm. In each of these figures, the decoded image quality is roughly the samebut there
is a big difference in compressed bit rate.
217
So far we havediscussed only spatial distortion (the variation in quality of individual frames
in the decoded video sequence). It is also important to consider temporal distortion, i.e. the
situation wherecompleteframesare
dropped fromthe original sequence in orderto
achieve acceptableperformance. The curves shown in Figure 10.5 were generated for video
sequences encoded at 30 frames per second. It would be possible to obtain lower spatial
distortion by reducing the frame rate to 15 frames per second (dropping every second frame),
at theexpense of anincrease in temporal distortion (becausethe frame rate has been
reduced). The effect of this type of temporal distortion is apparent as jerky video. This is
usually just noticeable around15-20 frames per second and very noticeable below 10 frames
per second.
(10.1)
1. Encode a video sequence with a particular set of encoding parameters (quantiser step
size, macroblock mode selection, etc.) and measure the codedbit rate and decoded image
quality (or distortion). This gives a particular combination of rate (R) and distortion (D),
an R-D operating point.
2 . Repeat the encodingprocess with a different set of encoding parameters to obtain another
R-D operating point.
3. Repeat for further combinations of encoding parameters. (Note that the set of possible
combinations of parameters is very large.)
Figure 10.6 shows a typical set of operating points plotted on a graph. Each point represents
the mean bit rate and distortion achieved for a particular set of encoding parameters. (Note
that distortion [D] increases as rate [R] decreases). Figure 10.6 indicates that there are bad
and good rate-distortion points. In this example, the operatingpoints that give the best ratedistortion performance (i.e. the lowest distortion for a given rate R) lie close to the dotted
curve. Rate-distortion theory tells us that this curve is convex (a convex hull). For a given
218
D
A
C\\
..,.
e
Operating points
target rate R,,,, the minimum distortion D occurs at a point on this convex curve. The aim of
rate-distortion optimisation is to find a set of coding parameters that achieves an operating
point as close as possible to this optimum curve.
One way to find the position of the hull and hence achieve this optimal performance isby
using Lagrangian optimisarion.Equation IO. I is difficult to minimise directly and a popular
method is to express it in a slightly different way as follows:
min{J = D
+ XR}
(10.2)
J is a new function that contains D and R (as before) as well as a Lagrange multiplier, X. J is
the equation of a straight line D A R , where X gives the slopeof the line. There is a solution
to Equation 10.2 for every possible multiplier X, and each solution is a straight line that
makes a tangent to the convex hull described earlier. The procedure may be summarised as
follows:
1. Encode the sequence many times, each time with a different set of coding parameters.
2. Measurethe coded bit rate ( R ) and distortion ( D ) of each coded sequence. These
measurements are the operating points ( R , D).
3. For each value of X, find the operating point( R , D ) that gives the smallestvalue J , where
J = D + XR. This gives one point on the convex hull.
4. Repeat step (3) for a range of X to find the shape of the convex hull.
This procedure is illustrated in Figure 10.7. The ( R , D ) operating points are plotted as before.
Three values of X are shown: X,, X*. and X3. In each case, the solution to J = D ;\R is a
straight line with slopeX. The operating point (R, D ) that gives the smallestJ is shown in black,
and these points occur on the lower boundary (the convex hull) of all the operating points.
The Lagrangian method will find the set (or sets) of encoding parameters that give the best
performance and these parameters may then be applied to the video encoder to achieve
optimum rate-distortion performance. However, this is usually a prohibitively complex
219
Example
Macroblock 0 in a picture is encoded using
MPEG-4 (simple profile) with a quantiser step
size Q0 in the range2-31. The choice of Q1 for macroblock 1 is constrained toQ. +/- 2.
There are 30 possible values of Qo; (almost) 30 x 5 = 150 possible combinations of Q.
and Q , ; (almost) 30 x 5 x 5=750 combinations of Qo, Ql and Q2;and so on.
The computation required to evaluateall possible choices of encoding decision becomes
prohibitive even for a short video sequence. Furthermore, no two video sequences produce
the same rate-distortion performance for the same encoding parameters
and so this process
needs to be carried out each time a sequence is to be encoded.
There have been a number
of attempts to simplify the Lagrangian optimisation method in
order to make it more practically
useful.24 For example, certain assumptionsmay be made
about good and bad choices of encoding parameters in order to limit the exponential growth
of complexity described above. The computational complexityof some of these methods is
still much higher than the computation required for the encoding process itself: however, this
complexity may be justified in some applications, such as (for example) encoding a feature
film to obtain optimum rate-distortion performance for storage on a DVD.
An alternative approach is to estimate the optimum operating points using a model
of
therate-distortion character is ti^.^ Lagrange-basedoptimisationis
first carriedouton
somerepresentativevideosequencesinorder
tofind the 'true' optimalparametersfor
these sequences. The authors propose a simple model of the relationshipbetween encoding mode
selection andX and the encoding mode decisions required to achieve minimal distortion for a
given rate constraint R,, can be estimated from this model. The authors report a clear
performance gain over previous methodswith minimal computational complexity. Another
220
COMPLEXITY
DISTORTION
AND
RATE,
attempt has been made6 to define an optimum partition between the coded bits representing
motion vector information and the codedbits representing displaced frame difference (DFD)
in an inter-frame CODEC.
10.2.4 PracticalRateControlMethods
Bit-rate control ina real-time video CODEC requiresa relatively low-complexity algorithm.
The choiceof rate control algorithm canhave a significant effect on video quality and many
alternative algorithms have been developed. The choice of rate control algorithm is not
straightforward because a number of factors are involved, including:
0
0
Q control
Encoder
1
Video frames
Output buffer
l
Rate R,
Rate R,
Channel
221
Buffer contents B
Unconstrained
Frame I
Frame 2
Frame 3
Frame 4
Frame 5
Frame 6
Time
variable rate R,, it is possible for the buffer contents to rise to a point at which the buffer
overflows (B,,, in the figure). The black line shows the unconstrained case: the buffer
overflows in frames 5 and 6. To avoid this happening, a feedback constraint is required,
wherethe buffer occupancy B is fed back to control thequantiserstep size Q. As B
increases, Q also increases which has the effect of increasing compression and reducing the
number of bits per frame bi. The grey line in Figure 10.9 shows that with feedback, the buffer
contents are never allowed to rise above about 50% of B,,,.
This methodissimple
and straightforward but has several disadvantages. A sudden
increase in activity in the video scene may cause B to increase too rapidly to be effectively
controlled by the quantiserQ, so that the buffer overflows, and in this case theonly course of
action is to skip frames,resulting in a variable frame rate. As Figure 10.9 shows, B increases
towards the end of each encoded frame and this means that Q also tends to increase towards
the end of the frame. This can lead to an effect whereby the top of each frame is encoded
with a relatively high quality whereas the foot of the frame is highly quantised and has an
obvious drop in quality, as shown in Figure 10.10. The basic buffer-feedback method tends
to produce decoded video with obvious quality variations.
222
low quantiser
<
medium quantiser
high quantiser
allocation:
assign a target number of bits to the current GOP (based on the target constant bit
rate);
assign a target number of bits T to the current picture based on:
- the complexity of the previous picture of the same type (I, P, B) (i.e. the level of
temporal and/or spatial activity);
- the target number of bits for the GOP.
2. Ratecontrol:
(a) during encoding of the current picture, maintain a count
of the number of coded bits
so far, d;
(b) compare d with the target total numberof bits Tand choose the quantiser step size
Q
to try and meet the target T.
3. Modulation:
(a) measure the variance of the luminance data in the current macroblock;
(b) if the variance is higher than average
(i.e. there is a high level
of detail in the current
region of the picture), increase Q (and hence increase compression).
The aim of this rate control algorithm is to:
0
223
This last aim should give improved subjective visual quality since the human eye is more
sensitive to coarse quantisation (high distortion)
in areas of low detail (such as a smooth
region of the picture).
Frame level control Each encoded frame adds to the encoder output buffer contents; each
transmitted frame removes bits from the output buffer. If the number of bits in the buffer
exceeds a threshold M, skip the next frame; otherwise set
a target number of bits B for
encoding the next frame. A higher threshold M means fewer skipped frames, but a larger
delay through the system.
Macroblock level control This is based on a model for the number of bits Bi required to
encode macroblock i (Equation 10.3):
(10.3)
3. Encodethemacroblock.
4. Update the model parameters K and C based on the actual numberof coded bits produced
for the macroblock.
The weight ai controls the importance of macroblock i to the subjective appearance of the
image: a low value of a; means that the current macroblockis likely to be highly quantised.
In the test model, these weights are selected to minimise changes
in Qi at lower bit rates
because each change involves sending a modified quantisation parameter DQUANT which
means encoding an extra 5 bits per macroblock. It is important to minimise the number of
changes to Qi during encoding of a frame at low bit rates because the extra 5 bits in a
macroblock may becomesignificant;athigher
bit rates, this DQUANT overhead is less
important and we may change Q more frequently without significant penalty.
Thisratecontrolmethodiseffectiveatmaintaininggood
visual quality with asmall
encoder output buffer which keeps coding delay to
a minimum (important for low-delay
real-time communications).
224
25
1000
0
50
100
Frame
150
200
Example
A 200-frame video sequence, Carphone, is encoded using H.263 with TM8 rate control.
The original frame rate is 30 frames per second, QCIF resolution, and the target bit rate is
64 kbps.Figure 10.1 I showsthebit-rate variation duringencoding. In order to achieve
64 kbps without dropping any frames, the mean bit rate should be 2 133 bitsper frame, and
the encoder clearly manages to maintain this bit rate (with occasional variations of about
+/- 10%). Figure 10.12 shows the PSNR of each frame in the sequence after encoding and
decoding. Towards the end of the sequence, the movement in the scene increases and it
becomes harder to code efficiently. The rate control algorithm compensates for
this by
increasing the quantiser step size and the PSNR drops accordingly. Out of the original 200
frames, the encoder has to drop 6 frames to avoid buffer overflow.
MPEG-4 Annex L
The MPEG-4 video standard describes an optional rate control algorithm
inAnnex L. I ,
known as the Scalable Rate Control (SRC) scheme. This algorithm is appropriate
for a single
video object (i.e. a rectangular V 0 that covers the entire frame) anda range of bit rates and
spatial/temporal resolutions. The scheme described in Annex L offers rate control at the
frame-level only (i.e. a single quantiser step sizeis chosen for a complete frame). The SRC
attempts to achieve a target bit rate over a certain number of frames (a segmentof frames,
usually starting with an I-picture).
225
DISTORTION
ANDBIT RATE
Carphone: H.263 TMN-8 rate control, 64kbps
I
26 I
50
1 00
Frame
150
I
200
The SRC scheme assumes the following model for the encoder rate R:
(10.4)
Q is the quantiser step size, S is the mean absolute difference of theresidual frame after motion
compensation and XI,X, are model parameters. S provides a measure of frame complexity
(easier to computethan the standard deviation c used in the H.263 TM8 rate control scheme
because the absolute difference, SAE, is calculated during motion estimation).
Rate control consists of the following steps which are carried out after motion compensation and before encoding of each frame i:
1 . Calculate a target bit rate Ri, based on the number of frames in the segment, the number
of bits that are available for the remainder
of the segment, the maximum acceptable
buffer contentsand the estimatedcomplexity of frame i. (Themaximum buffer size
affectsthelatencyfromencoderinputtodecoderoutput.
If thepreviousframe was
complex, it is assumed that the next frame will be complex and should therefore be
allocated a suitable number of bits: the algorithm attempts to balance this requirement
against the limit on the total number of bits for the segment.)
2. Compute the quantiser step sizeQ; (to be applied to the whole frame), calculateS for the
complete residual frame and solve Equation 10.4 to find Q .
3. Encode the frame.
226
DISTORTION
AND
RATE,
COMPLEXITY
4. Update the model parameters X , , X , based on the actual number of bits generated for
frame i.
The SRC algorithm differs from H.263 TM8 in two significant ways: it aims to achieve a
target bit rate across a segment of frames (rather than a sequence of arbitrary length) and it
does not modulate the quantiser step sizewithin a coded frame (this cangive a more uniform
visual appearance within each frame but makes it difficult to maintain a small buffer size and
hence a low delay). An extension to the SRC is described in Annex L.3 of MPEG-4 which
supports modulation of the quantiser step size at the macroblock level and is therefore more
suitable for low-delay applications. The macroblock rate
control extension (L.3) is similar to
H.263 Test Model 8 rate control.
TheSRC algorithmisdescribed
insomedetailintheMPEG-4standard;
a further
discussion of MPEG-4 rate control issues can be found elsewhere.
10.3 COMPUTATIONAL
COMPLEXITY
10.3.1 ComputationalComplexityandVideoQuality
So far wehave considered thetrade-off between bit rate and video quality. The discussion of
rate distortion in Section10.2.3
highlighted another trade-off between computational
complexity and video quality. A video coding algorithm that gives excellent rate-distortion
performance (good visual qualityfor a given bit rate) may be impractical because it requires
too much computation.
There are a number of cases where it is possible to achieve higher visual quality at the
expense of increased computation. A few examples are listed below:
0
DCT block size: better decorrelation canbe achieved with a larger DCT block size, at the
expense of highercomplexity.The
8 x 8 blocksizeispopularbecauseit
achieves
reasonable performance with manageable computational complexity.
Motion estimation search algorithm: full-search motion estimation (whereevery possible
matchisexamined
within the search area) can outperform most reduced-complexity
algorithms. However, algorithms such as the three step search which sample only a few
of the possible matches arewidely used because they reduce complexity at the expense of
a certain loss of performance.
Motion estimation search area: a good match (and hence better rate-distortion performance) is more likely if the motion estimation search area is large. However, practical
video encoders limit the search area to keep computation to manageable levels.
obtainingoptimal(or
even near-optimal) rate-distortion
performance requires computationally expensive optimisation
of encoding parameters,
i.e. the best visual quality for a given bit rate is achieved at the expense of high complexity.
e Rate-distortionoptimisation:
Choice offrame rate: encoding and decoding computation increases with frame rate and
itmay
be necessary toaccept
a low framerate
(and jerky video) because of
computational constraints.
COMPUTATIONAL, COMPLEXITY
227
These examples show that many aspects of video encoding and decoding are a trade-off
betweencomputationandquality.Traditionally,hardwarevideoCODECshavebeen
designed with a fixed level of computational performance. The architecture and the clock
rate determine the maximum video processing
rate. Motionsearcharea,blocksizeand
maximum frame rate are fixed
by the design and place a predetermined ceiling
on the ratedistortion performance of the CODEC.
Recent trends in video CODEC design,
however, require a more flexible approach
to these
trade-offs between complexity and quality. The following scenarios illustrate this.
100%
Video CODEC
I AudioCODEC
Video CODEC
decryption
Operating system
Operating system
0%
@>
228
COMPLEXITY
DISTORTION
AND
RATE,
10.3.2 VariableComplexityAlgorithms
A variable complexity algorithm (VCA) carries out a particular task
with a controllable
degree of computationaloverhead. As discussedabove,computationisoften
related to
image quality and/or compression efficiency: in general, better image quality and/or higher
compression require a higher computational overhead.
Input-independent VCAs
In this class of algorithms, the computational complexityof the algorithm is independent of
the input data. Examples of input-independent VCAs include:
Frame skipping: encoding a frame takes a certain amount
of processing resources and
skipping frames (i.e. not coding certain frames in the input sequence) is a crude but
effective way of reducing processor utilisation. The relationship between frame rateand
utilisation is not necessarily linear in an inter-frame CODEC: when the frame rate is low
(because of frame skipping), there is likely to be a larger difference between successive
frames and hence more datato code in the residual frame. Frame skipping may lead to a
variable frame rate as the available resources change and this can be very distracting to
the viewer. Frame skipping is widely used in software video CODECs.
Motion estimation (ME) searchwindow: increasing or decreasing the ME search window
changes the computational overhead of motion estimation. The relationship between search
window size and computational complexity depends on the search algorithm. Table 10.1
compares the overhead of different search window sizes for the popular n-step search
algorithm. With no search, only the (0, 0) position is matched; with a search window of
+ / - 1, a total of nine positions are matched; and so on.
Number of
comparison steps
Computation
(normalised)
1
9
17
25
33
0.03
0.27
0.5 1
0.76
1.o
COMPUTATIONAL COMPLEXITY
2x2
4x4
229
8x8 DCT
Input-dependent algorithms
An input-dependent VCA controls computational complexity depending on the characteristics of the video sequence or coded data. Examples include the following.
>
else {
[ c a l c u l a t e the IDCT..]
230
COMPLEXITY
DISTORTION
AND
RATE,
Thereis a small overhead associated with testing for zero: however, thecomputational
savingcan be very significant and there is no loss of quality. Further input-dependent
complexity reductions can be applied to the IDCT.I2
FDCT complexity reduction Many blocks contain few non-zero coefficients after
quantisation (particularly in inter-coded macroblocks). It is possible to predict the occurrence of some of theseblocksbeforetheFDCTiscarried
out so that theFDCTand
quantisation steps may be skipped, saving computation.
The sum of absolute differences
(SAD or SAE) calculated during motion estimation can act as a useful predictor for these
blocks. SAD isproportional to the energy remainingin the block aftermotion compensation.
If SAD is low, the energy in the residual block is low and it is likely that the block will
contain little or no data after FDCT quantisation.
and
Figure 10.15 plots the probability that a
block contains nocoefficients after FDCTand quantisation, against SAD. Thisimplies that it
should be possible to skip the FDCT and quantisation steps for blocks with an SAD of less
than a threshold value T :
if ( S A D < T )
If we set T = 200 then any block with SAD < 200 will not be coded. According to the
figure, this prediction of zero coefficients will be correct 90% of the time. Occasionally
(10% of the timein this case), theprediction will fail, i.e. a block will be skipped that should
have been encoded. Thereduction in complexity due to skipping FDCT andquantisation for
some blocksistherefore
offset by anincrease in distortion due to incorrectly skipped
xl
200
400
600
800
Sum of Absolute Differences
COMPUTATIONAL, COMPLEXITY
231
Input-dependent motion estimation A description has been given15 of a motion estimationalgorithm with variablecomputationalcomplexity.Thisisbasedonthenearest
neighbours search (NNS) algorithm (described in Chapter6),where motion search positions
are examined in a series of layers until a minimum is detected. The NNS algorithm is
extended to a VCA by adding a computational constraint on the numberof layers that are
examined at each iteration of the algorithm. As with the SAD prediction discussed above,
thisalgorithmreducescomputationalcomplexity
at theexpense of increasedcoding
distortion. Other computationally scalable algorithms for motion estimation algrithms
are
described el~ewhere.~~
10.3.3 Complexity-RateControl
The VCAs described above are useful for controlling the computational complexity
of video
encoding and decoding. Some VCAs (such as zero testing in the IDCT) have no effect on
image quality; however, the more flexible
and powerful VCAs (such as zero DCT prediction)
do havean effect on quality. These VCAs may.also change the coded bit rate: for example,
if
a high proportionof DCT operations are skipped, fewer coded bits will be produced
and the
rate will tend to drop. Conversely, the
target bit rate can affect computational complexity
if
VCAs are used. For example, a lower bit rate
and higher quantiser scalewill tend to produce
fewer DCT coefficients and a higher proportion
of zero blocks, reducing computational
complexity.
0.5 0
Rate (kbps)
232
COMPLEXITY
DISTORTION
AND
RATE,
Itistherefore
not necessarily correctto treat complexitycontrolandrate
control
asseparate issues. Aninteresting recent development isthe emergence of complexitydistortion theory. Traditionally, video CODECs have been judged by their rate-distortion
performance as described in Section 10.2.2. Withthe introduction of VCAs,itbecomes
necessary toexamine performance in three axes: complexity, rateand distortion. The
operating point of a videoCODECis
no longer restricted to a rate-distortion curve
butinsteadlieson
a rate-distortion-complexity su@zce, liketheexample
shown in
Figure 10.16. Each point on this surface represents a possible set of encoding parameters,
leadingto a particularset of values forcoded bit rate, distortion andcomputational
complexity.
Controllingrate involves moving theoperating point along this surface in the ratedistortion plane; controlling complexity involves moving the operating point in the complexitydistortion plane. Because of the interrelationship between computational complexity
and bit
rate, it may be appropriate to control complexity and rate
at the same time. Thisnew area of
complexity-rate control is at a very early stage and some preliminary results can be found
elsewhere.14
10.4 SUMMARY
ManypracticalvideoCODECshave
to operate in a rate-constrained environment. The
problem of achievingthe best possible rate-distortion performance isdifficult tosolve
andoptimumperformancecan
only be obtained at theexpense of prohibitively high
computational cost. Practical rate control algorithms aim to achieve good, consistent video
quality within the constraints of rate, delay andcomplexity. Recent developments in variable
complexity coding algorithms enable a further trade-off between computational complexity
and distortion and are likely to become important for CODECs with limited computational
resources andlor power consumption.
Bit rate is oneof a number of constraints that are imposedby the transmission or storage
environment. Video CODECs are designed for use in communicationsystemsand these
constraints must be taken into account. In the next chapter we examine the key quality of
service parameters required by a video CODEC and provided by transmission channels.
REFERENCES
1. A. Ortega and K. Ramchandran, Rate-distortion methods for image and video compression,lEEE
Signal Processing Magazine, November 1998.
2. L-J Lin and A. Ortega, Bit-rate control using piecewise approximated rate-distortion characteristics, fEEE Trans. CSVT, 8, August 1998.
3. Y. Yang, Rate control for video codingand transmission, Ph.D. Thesis, Cornell University, 2000.
4. M. Gallant and F. Kossentini, Efficient scalable DCT-based video coding at low bit rates, Proc.
ZCfP99, Japan, October, 1999.
5. G. Sullivan and T. Wiegand, Rate-distortion optimization for video compression,
IEEE Signal
Processing Magazine, November 1998.
6. G. M. Schuster and A. Katsaggelos, A theory for the optimal bit allocation between displacement
vector field and displaced frame difference, IEEE J. Selected Areas in Cornrnunicarions, 15(9),
December 1997.
REFERENCES
233
7. ISO/IEC JTCI/SC29 WG11 Document 93/457, PEG-2 Video Test Model5, Sydney, April 1993.
8. J. Ribas-Corbera and S. Lei, Rate control for low-delay video communications [H.263 TM8 rate
control], ITU-T Q6/SG16 Document Q15-A-20, June 1997.
9. J. Ronda, M. Eckert, F. Jaureguizar and N. Garcia, Rate control and bit allocation for MPEG4,
IEEE Trans. C S W , 9(8), December 1999.
10. C. Christopoulos, J. Bormans, J. Cornelisand A. N. Skodras,Thevector-radixfastcosine
transform: pruning and complexity analysis, Signal Processing, 43, 1995.
1 1 . A. Hossen and U. Heute, Fast approximate DCT basic idea, error analysis, applications, Proc.
ICASSP97, Munich, April 1997.
12. K. Lengwehasatit and A. Ortega, DCT computation based on variable complexity fast approximations, Proc. ICIP98, Chicago, October 1998.
13. M-T. Sun and I-M. Pao,Statistical computation of discrete cosine transform in video encoders, J.
Visual Communication and Image Representation, June 1998.
14. I. E. G. Richardson and Y. Zhao, Video CODEC complexity management, Proc. PCSOI, Seoul,
April 2001.
15. M. Gallant, G. C6tC and F. Kossentini, An efficient computation-constrained block-based motion
estimationalgorithmforlow
bit ratevideocoding,
IEEE Trans.ImageProcessing,
8(12),
December 1999.
A. Bassoand A. Reibman, A novelcomputationallyscalable
16. K. Lengwehasatit,A.Ortega,
algorithm for motion estimation, Proc. VCIP98, San Jose, January 1998.
17. V. G. Moshnyaga, A new computationallyadaptiveformulation
of block-matchingmotion
estimation, IEEE Trans. CSVT, 11(1), January 2001.
18. V. K. Goyal and M. Vetterli, Computation-distortion characteristics of block transform coding,
Proc. ICASSP97, Munich, April 1997.
l1
Transmission of Coded Video
11.1 INTRODUCTION
A video communication system transmits coded video data across a channelor network and
the transmission environment has a number
of implications for encoding and decoding of
video. The capabilitiesandconstraints of thechannelornetwork
vary considerably,for
example from low bit rate, error-prone transmission over a mobile network to high bit rate,
reliable transmission over a cable television network. Transmission constraints should be
taken into account when designing or specifying video
CODECs; the aim is not simply to
achieve the best possible compression but to develop a video coding system that is well
matched to the transmission environment.
This problem of matching the application to the network is often described as a quality
of service (QoS) problem. There are two sides to the problem: the QoS required by the
application (which relates to visual quality perceived by the user) and the
QoS ofleered by the
transmission channel or network (which depends on the capabilities of the network). In this
chapter we examine QoS from these two points of view and discuss design approaches that
help to match the offered and required
QoS. We describe two examples of transmission
scenarios and discuss how these scenarios influence video CODEC design.
Data rate
A videoencoderproducescodedvideoatavariableorconstantrate(asdiscussedin
Chapter 10). The key parameters for transmission are the mean bit rate and the variation of
the bit rate.
236
The mean rate (or the constant rate for CBR video) depends on the characteristics of the
source video (frame size, numberof bits per sample, frame rate, amountof motion, etc.) and
on the compression algorithm. Practical video coding algorithms incorporate
a degree of
compression control (e.g. quantiser step size and mode selection) that allows some control
of
the mean rate after encoding. However, for a given source (with a particular frame size and
frame rate) there is an upper andlower limit on the achievable mean compressed bit rate.
For
example, broadcast TV quality video (approximately 704 x S76 pixels per frame, 25 or
30 frames per second) encoded using MPEG-2 requires a mean encoded bit rate of around
2-5 Mbps for acceptable visual quality.In order to successfully transmit video at broadcast
quality, the network or channel must support at least this bit rate.
Chapter IO explained how thevariationincodedbitratedependsonthevideoscene
content and on the type of rate control algorithmused. Without rate control, a videoCODEC
tends to generate more encoded data
when the scene contains a lot
of spatial detail and
movement and less data when the scene is relatively static. Different encoding modes (such
as I, P or B-pictures in MPEG video) produce varying amounts
of coded data. An output
buffer together with a rate control algorithm may be used to map this variable rate to either
a constantbitrate(nobitratevariation)oravariable
bit ratewithconstraintsonthe
maximum amount of variation.
Distortion
Most of thepracticalalgorithmsforencoding
of real-timevideoarelossy,
i.e. some
distortion is introduced by encoding and the decoded video sequence is
not identical to
theoriginalvideosequence.
The amount of distortionthatisacceptabledepends
on the
application.
Example 1
A movie is displayed on a large, high-quality screen at HDTV resolution. Capture and
editing of the video material isof a very high quality and the viewing conditions are good.
In this example, there is likely to be
a low threshold for distortion introduced by the
video CODEC, since anydistortionwilltend
to be highlighted by thequality of the
material and the viewing conditions.
Example 2
A small video window is displayed on
a PC as part of a desktop video-conferencing
application. The scene being displayed ispoorly lit; the camera is cheap and placed at an
inconvenient angle; the video is displayed at
a low resolution alongside a number of other
application windows. In this example, we might expect a relatively high threshold for
distortion. Because of the many other factors limiting the quality
of the visual image,
distortion introduced by the video CODEC maynot be obvious until it reaches significant
levels.
237
QUALITY OF SERVICE
REQUIREMENTS
AND
CONSTRAINTS
Ideally, distortion due to coding should be negligible, i.e. the decoded video should be
indistinguishablefromtheoriginal(uncoded)video.Morepracticalrequirements
for
distortion may be summarised as follows:
1. Distortion should be acceptable for the application. As discussed above, the definition
of acceptable vanes depending on the transmission and viewing scenario: distortion due
to coding should preferably not be the dominant limiting factor for video quality.
2. Distortion should be near constant from a subjective viewpoint. The viewerwill quickly
become used to a particular level of video quality. For example, analogueVHS video is
relatively low quality but this has not limited the popularity
of the medium because
viewers accept a predictable level of distortion. However, sudden changes in quality (for
example, blocking effects due to rapid motion or distortion due to transmission errors)
are obvious to the viewer and can have a very negative effect on perceived quality.
Delay
By its nature, real-time video is sensitive to delay. The QoS requirements in terms of delay
depend on whether video is transmitted one way
(e.g. broadcast video, streaming video,
playback from a storage device) or two ways (e.g. video conferencing).
Simplex (one-way) video transmission requires frames
of video to be presented to the viewer
at the correct points in time. Usually, this means a constant frame rate; in the case where a
frame is not available at the decoder (for example, due to frame skipping at the encoder), the
other frames should be delayed as appropriate
so that the original temporal relationships
between frames are preserved. Figure
11.1 shows an example: frame
3 from the original
sequence is skippedby the encoder (because of rate constraints) and the frames arrive at the
decoder in order I, 2,4,5,6. The decoder musthold frame 2 for two frame periodsso that
the later frames (4, 5 , 6) are not displayed too early with respect to frames l and 2. In effect,
the CODEC maintainsaconstantdelaybetweencaptureanddisplay
of frames. Any
accompanying media that is linked to the video frames must remain synchronised: the
most common example is accompanying audio, whereloss
a of synchronisation of more than
about 0.1 S can be obvious to the viewer.
[ T I F/
skip
frame 3
15118)
1411 5 1 161
Encoded frames
Received frames
m15)
F/
hold
frame 2
Displayed frames
238
TRANSMISSIONVIDEO
OF CODED
Duplex (two-way) video transmission has the above requirements (constant delay in each
direction, synchronisation between related media) plus the requirement that the total delay
from capture to display must be kept low. A rule of thumb for video conferencing is to keep
the total delay less than 0.4 S. If the delay is longer than this, normal conversation becomes
difficult and artificial.
Interactive applications, in which the viewers actions affect the encoded video material,
also have a requirement of low delay. An example is remote VCR controls (play, stop, fast
forward,etc.)for a streamingvideosource. A longdelaybetweenthe
user action (e.g.
fastforwardbutton)andthecorrespondingeffectonthevideosourcemaymakethe
application feel unresponsive.
Figure 11.2 illustrates these three application scenarios.
Encoder
Decoder
Capture
U*
Display
Network
Display
Encoder
t
I
Decoder
Network
Display
Feedback
239
Data rate
Circuit-switched networks such as the PSTN/POTS provide a constant bit rate connection.
Examples include 33.6 kbps for a two-way modem connection over
an analogue PSTN line;
56 kbps downstream connection from an Internet service provider (ISP) over an analogue
PSTN line; 128 kbps over basic rate ISDN.
Packet-switched networks such as Internet Protocol(IP) and Asynchronous Transfer Mode
(ATM) provide a variable rate packet transmission service. This implies that these networks
may be better suited to carrying coded video (with its inherently variable bit rate). However,
the mean and peak packet transmission rate depend on the capacity
of the network and may
vary depending on the amount of other traffic in the network.
The data rate of a digital subscriber line connection (e.g. Asymmetric Digital Subscriber
Line, ADSL) can vary depending on the quality of the line from the subscriber to the local
PSTN exchange (the local loop). The end-to-end rate achieved over this typeof connection
may depend on thecore network (typically IP) rather than the local
ADSL connection speed.
Dedicated transmission services such as satellite broadcast, terrestrial broadcast and cable
TV provideaconstantbitrateconnectionthat
is matchedtothe
QoS requirements of
encoded television channels.
Errors
The circuit-switched PSTN and dedicated broadcast channels have a low rate of bit errors
(randomlydistributed,independenterrors,case(a)inFigure
11.3). Packet-switched
\\
Figure 11.3
Transmitted bit
sequence
240
networks such as IP usually have a low bit error rate but can suffer from packet loss during
periods of network connection (loss of the data payload of a complete network packet,
case (b) in Figure 11.3). Packet loss is often bursty, i.e. a high rate of packet loss may be
experiencedduring a particularperiodfollowed by amuchlowerrate
of loss. Wireless
networks (such as wireless LANs and personal communications networks) may experience
high bit error rates due to poorpropagation conditions. Fading of the transmitted signal can
lead to bursts of bit errors in this type of network (case (c) in Figure l 1.3, a sequence of bits
containingasignificant
number of bit errors).Figure 11.4 showsthepath
loss (i.e. the
variation in received signal power) between a base station and receiver in a mobile network,
plotted as a function of distance. A mobile receiver can experience rapid fluctuations
in
signal strength (and hence in error rates) due to fading effects (such as the variation with
distance shown in the figure).
Delay
Circuit-switchednetworksanddedicatedbroadcastchannelsprovideanear-constant,
predictable delay. Delay through a point-to-point wireless connection is usually predictable.
The delaythrough a packet-switched networkmay be highly variable, dependingon the route
taken by the packet and the amount of other traffic. The delay through a network node, for
example, increases if the traffic arrival rate is higher than the processing rate of the node.
Figure 11.5 shows how two packets may experience very different delays as they traverse a
packet-switched network. In this example, a packet following route A passes through four
routersandexperienceslong
queuing delayswhereasapacketfollowingroute
B passes
through two routers with very little queuing time. (Some improvement may be gained by
adopting virtual circuit switching where successive packets
from the same source follow
identical routes.) Finally, automatic repeat request (ARQ)-based error control can lead to
very variable delays whilst waiting for packet retransmission, and so ARQ is not generally
appropriateforreal-timevideotransmission(except
in certainspecial cases forerror
handling, described later).
Signal
strength
Distance
Figure 11.4
241
ROUTE A
Input
queue
Router
Data rate
Mosttransmissionscenariosrequire
some form of ratecontroltoadaptthe
inherently
variable rate produced by a video encoder to a fixed or constrained bit rate supported by a
network or channel. A rate control mechanism generally consists of an output buffer and a
feedback control algorithm; practical rate control algorithms are described in Chapter 10.
Errors
A bit error in a compressed video sequence
can causea cascade ofeffects that may leadto severe
picture degradation. The following example illustrates the potential effects of a single bit error:
1. A single bit error occurs within variable-length coded transform coefficient data.
242
TRANSMISSIONVIDEO
OF CODED
regain synchronisation with the correct sequence of syntax elements. The decoder can
always recover at thenext resynchronisation marker (such as a slice start
code [MPEG-21
or G O B header [MPEG-4/H.263]). However, a whole section
of the current framemay be
corrupted before resynchronisation occurs.This effect is knownas spatial erwr propagaof the frame to be distorted.
tion, where a single error can cause a large spatial area
Figure 11.6 shows an example: a single bit error affects a macroblock in the second-last
row inthispicture(codedusing
MPEG-4). Subsequentmacroblocks areincorrectly
decoded and the errored region propagates until the end of the row of macroblocks
(where a GOB header enables the decoder to resynchronise).
(a)
Figure 11.7 Example of temporalerrorpropagation
(b)
243
Increase in erroredareaduringtemporalpropagation
Figure 11.8
here were predicted from the errored frame shown in Figure 11.6: no further errors have
occurred, but the distorted area continues to appear in further predicted frames. Because
the macroblocks of the frames in Figure 11.7 are predicted using motion compensation,
the errored region changes shape.The corrupted area may actually increasein subsequent
predictedframes,asillustrated
in Figure 11.8: inthisexample,motionvectorsfor
macroblocksinframe
2 pointtowards an erroredarea in frame 1 and so theerror
spreads out in frame 2. Over a long sequence of predicted frames, an errored region will
tend to spread out and also to fade as it is added to by successive correctly decoded
residual frames.
In practice, packet losses are more likely to occur than
bit errors in many situations. For
example, a network transport protocol may discard packets containing bit errors. When a
packet is lost, an entire section of coded data is discarded. A large section of at least one
frame will be lost and this area may be increased due to spatial and temporal propagation.
Delay
Any delay within the video encoder and decoder must not cause the total delay to exceed the
limitsimposedbytheapplication
(e.g. a totaldelay of 0.4s forvideoconferencing).
Figure 11.9 shows the main sources of delay within a video coding application: each of the
components shown (from the capture buffer through to the display buffer) introduces a delay.
(Note that any multiplexing/packetising delay is assumed to be included in the encoder
output buffer and decoder input buffer.)
Capture
Output
Input
Dlsplay Dlsplay
buffer
Figure 11.9
buffer
buffer
244
245
Encoder
Resynchronisation methods may be used
to limit error propagation. These include restart
markers (e.g. slice start code, GOB header) to limit spatial error propagation, intra-coded
pictures (e.g. MPEG I-pictures) to limit temporal error propagation and intra-coded macroblocks to force an error-free update of a region of the p i ~ t u r eSplitting
.~
an encoded frame
into sections that may be decoded independently limits the potential for error propagation
andH.263+AnnexR(independentsegmentdecoding)andthevideopacketmode
of
MPEG-4 support this. A further enhancement of error resilience is providedby the optional
reversible variable length codes (RVLCs) supportedby MPEG-4 and describedin Chapter 8.
Layered or scalable coding (such
as the four scalable modes
of MPEG-2) can improve
performance in the presence of errors. The base layer in a scalable coding algorithm is
usually more sensitive to errors than the enhancement layer (S), and some improvement in
errorresiliencehasbeendemonstratedusing
unequalerrorprotection,
i.e. applying
increased error protection to the base layer.3
Channel
Suitabletechniquesincludetheuse
of errorcontrolcoding536and
intelligent mapping
of coded data into packets. The error control code specified in H.261 and H.263 (a BCH
code)cannotcorrectmanyerrors.Morerobustcodingmaybemoreappropriate
(for
example, concatenated Reed-Solomon and convolutional coding for MPEG-2 terrestrial or
satellite transmission, see Section 11.4.1). Increased protection from errors can be provided
at the expense of higher error correction overhead in transmitted packets. Careful mapping
of encoded data into network packets can minimise the impact
of a lost packet. For example,
placing an independently decodeable unit (suchas an independent segment or video packet)
into each transmitted packet meansthat a lost packet will affect the smallest possible areaof
a decoded frame (i.e. the error will not propagate spatially beyond the data contained within
the packet).
Decoder
A transmission error may cause a violation of the coded data syntax that is expectedat the
decoder.Thisviolationindicates
theapproximatelocation of thecorrespondingerrored
regioninthedecodedframe.Oncethis
is known,thedecodermayimplement
error
concealment to minimise the visual impact of the error. The extent of the errored region
can be estimated once the position of the error is known, as the errorwill usually propagate
spatially up to the next resynchronisation point (e.g. GOB header
or slice start code). The
decoder attempts to conceal the errored region by making use of spatially and temporally
adjacenterror-freeregions.
A number of errorconcealmentalgorithms
exist andthese
usually take advantage of the fact that a human viewer is more sensitive to low-frequency
components in the decoded image. An error concealment algorithm attempts to restore the
low-frequency information and (in some cases) selected high-frequency information.
Spatial error concealment repairs the damaged regionby interpolation from neighbouring
error-free pixels. Errors typically affect a stripe of macroblocks across a picture (see for
example Figure 11.6) and so the best method of interpolation is to use pixels immediately
246
1st errored
macroblock
above and below the damaged area as shown in Figure 1 1.10. A spatial filter may be used to
smooththeboundaries
of therepairedarea.Moreadvancedconcealmentalgorithms
attempttomaintainsignificantfeaturessuch
as edgesacrossthedamagedregion.This
usually requires a computationally complex algorithm, for example using projection onto
convex sets (see Section 9.3, Post-filtering).
Temporal error concealment copies data from temporally adjacent error-free frames to
hide the damaged a ~ - e a .A
~ ,simple
~
approach is to copy the same region from the previous
frame (often available in the frame store memory at the decoder).
A problem occurs when
there is a change between the frames due to motion: the copied area appears to
be offset
and this can be visually disturbing. This effect can be reduced by compensating for motion
and this is straightforward if motion vectors are available for the damaged macroblocks.
However, in manycasesthemotionvectorsmaybedamagedthemselvesand
must be
reconstructed, for example by interpolating from the motion vectors of undamaged macroblocks. Good results may be obtained
by re-estimating the motion vectorsin the decoder, but
this adds significantly to the computational complexity.
Figure 1 1. l 1 shows how the error-resilient techniques described above may be applied to
the encoder, channel and decoder.
Combined approach
Recently,somepromisingmethodsforerrorhandlinginvolvingcooperationbetween
severalstages of thetransmissionchainhavebeenproposed.9-Inareal-timevideo
Synchronisation
detection
Error
coding
control
markers
Error
ment Spatial Packetisation
Intra-coding
Reversible VLCs
Unequal error protection
concealment
Temporal
247
communication system it is not usually possible to retransmit damagedor lost packets due to
delay constraints: however, it is possible for the
decoder to signal the location of a lost
packet to the encoder. The encoder can then determine the area of the frame that is likely to
be affected by the error (a largerareathantheoriginalerroredregion
dueto motioncompensated prediction from the errored region) and encode macroblocks in this area using
intra-coding. This will have the effect of cleaning up the errored region once the feedback
message is received. Alternatively, the technique of reference picture selection enables the
encoder (and decoder) to choose an older, error-free frame for prediction of the next interframe once the position of the error is known. This requires both encoder and decoder to
store multiple reference frames.The reference picture selection modesof H.263 (Annex N
and Annex U) may be used for this purpose.
These two methods of incorporating feedback are illustrated in Figures 11.12 and 11.13.
In Figure 11.12, an error occurs during transmission
of frame 1. The decoder signals the
estimated location of the error to the encoder: meanwhile, the error propagates to frames 2
and 3 and spreads out due to motion compensation. The encoder estimates the likely spread
of the damagedarea and intra-codesanappropriateregion
of frame 4. The intra-coded
macroblockshalt the temporalerrorpropagation
and clean up decodedframes 4 and
onwards. In Figure 1 1.13, an error occurs in frame 1 and the error is signalled back to the
encoder by the decoder. On receiving the notification, the encoder selects a known good
reference frame (frame 0 in this case) to predict the next frame (frame 4). Frame 4 is intercoded by motion-compensated prediction from frame 0 at the encoder. The decoder also
selects frame 0 for reconstructing frame 4 and the result is an error-free frame 4.
11.3.3 Delay
The components shown in Figure 1 1.9 can each add to the delay (latency) through the video
communication system:
Encoder
Decoder
initial error
temporal
propagation
errorcleaned
up
248
Encoder
predict from older referenceframe
Decoder
Figure11.13
Reference pictureselection
Capture buffer: this should only add delay if the encoder stalls, i.e. it takes too long to
encode incoming frames. This may occur in a software video encoder when insufficient
processing capacity is available.
Encoder: I- and P-frames do not introduce a significant delay: however, B-picture coding
requires a multiple frame delay (as discussed in Chapter 4) and so the use of B-pictures
should be limited in a delay-sensitive application.
Output bufleer: the output buffer adds a delay that depends on its maximum size (in bits).
For example, if the channel bit rate is 64 kbps, a buffer of 32 kbits adds a delay of 0.5 S.
Keeping the buffer small minimises delay, but makes
it difficult to maintain consistent
visual quality (as discussed in Chapter 10).
Input buffer: the decoder input buffer size should be set to match the encoder output
buffer. If the encoder and decoder are processing video at the same rate (i.e. the same
249
SCENARIOS TRANSMISSION
Decoder: the use of B-pictures adds at most one frames delay at the decoder and so this
is not such a critical issue as at the encoder.
Display buffer: as with the capture buffer, the display buffer should not add a significant
delay unless a queue of decoded frames is allowed to build up due to variable decoding
speed. In this case, the decoder should pause until the correct time for decoding a frame.
H.320,I4 H.32415
H.323I6
MPEG-2 Systems
250
TRANSMISSIONVIDEO
OF CODED
MPEG-2 (Systems) describes two methods of multiplexing audio, video and associated
information, the program stream and the transport stream. In each case, streams
of coded
audio, video, data and system information are first packetised to form packetised elementary
stream packets (PES packets).
..............................................................................
System,
other data
...........................................
PES packets
from other
programs
Modulate
and
transmit
TS
packets
251
TRANSMISSION SCENARIOS
program decoder
..................... .. .......................
De-
modulate
Convolutional,
RS decode
Video
decoder
video
Audio
decoder
audio
System
info.
Clock
Two levels of error correcting coding provide protection from transmission errors. First,
16 parity bytes are added to each 188 byte TS packet to form a 204-byte Reed-Solomon
codeword and the stream of codewords are further protected with a convolutional code
(usually a 718 code, i.e. the encoder produces 8 output bits for every 7 input bits). The total
error coding overhead is approximately 25%. The outer convolutional code can correct
random bit errors and the innerReed-Solomon code can correct
burst errors up to 64 bits
in
length.
Figure 1 l . 15 illustrates the processof demultiplexing and decoding an MPEG-2 TS. After
correcting transmission errors, the streamof TS packets are demultiplexed and PES packets
correspondingtoaparticularprogramarebufferedanddecoded.ThedecoderSTCis
periodically updated when a SCR field is received and the STC provides a timing reference
for the video and audio decoders.
Stereoaudio
Video coded to approximately
3-S Mbps
An MPEG-2 video encodertdecoder design aims to provide high-quality video within these
transmission parameters. The channel coding (Reed-Solomon
and convolutional ECC) is
designed to correct most transmission errors and error-resilient video coding is generally
limited to simple error recovery (and perhaps concealment)
at the decoder to handle the
occasionaluncorrectederror.TheSTC
and the use of timestamps in eachPESpacket
provide accurate synchronisation.
252
H.323 components
Terminal This is the basicentityinanH.323-compliantsystem.
An H.323 terminal
consists of a set of protocols and functions and its architecture is shown
in Figure 11.16. The
mandatory requirements for an H.323 terminal (highlighted in thefigure) are audio coding
(using the G.711, G.723 or G.729 audiocodingstandards)andthreecontrolprotocols:
H.245 (channel control), 4.93 1 (call set-up and signalling) and
registration/admission/status
( M S ) (used to communicate with a gatekeeper, see below). Optional capabilities include
video coding (H.261, H.263), data communications(using T.120) and the real time protocol
(RP
for
)packet transmission over JP networks. All H.323 terminals supportpoint-to-point
conferencing (i.e. one-to-one communications), support for multi-point conferencing (three
or more participants) is optional.
control
System
CODEC
RAS
Audio Control
Interface
A
H245
A
I/O
Video Data
Data
interface
T. 120
Audio I10
:ElYEp26:
G.711 l G.723l
G.729
$
RTP
_..-_
.-.--.-.---..-..
_.._-.-----------___.._._..._._.._.__._.___...
~
H323
terminal
LAN Interface
Required
components
TRANSMISSION SCENARIOS
253
0
Figure 11.17 H.323 multi-point conferences
254
151 1 3 1 161
packets
Re-ordered
[T1 2 1 131
Received
packets
1 6 1
RTP may be used on top of UDP for transmission of coded video and audio. RTP adds
time stamps and sequence numbers to UDP packets, enabling
a decoder to identify lost,
delayed or out-of-sequence packets. If possible, a receiver will reorder the packets prior to
decoding; if a packet does not arrive in time, its position is signalled to the decoder so that
error recovery can be carried out. Figure
11.18 illustrates the way in which RTP reorders
packets and signals the presence of lost packets. Packet 4 from the original sequence is lost
duringtransmissionandtheremainingpacketsarereceivedout
of order.Sequence
numbering and time stamps enable the packets to be reordered and indicate the absence
of packet 4.
The realtime control protocol (RTCP) may be used tomonitorand control an RTP
session. RTCP sends quality control messages to each participant in the session containing
useful QoS information such as the packet loss rate.
The resource reservation protocol (RSVP) enables terminals to request
a guaranteed
transmission bandwidth for the duration of the communication session. This improves the
available QoS for real-time video and audio communications but requires support from every
switch or router in the section of network traversed by the session.
11.5 SUMMARY
Successful video communications relies upon matching the QoS required by an application
with the QoS provided by the transmission network. In this chapter we discussed key QoS
REEERENCES
255
parametersfromthepoint
of view of the video CODEC andthenetwork.Removing
subjectiveandstatisticalredundancythroughthevideocompressionprocesshasthe
disadvantagethatthecompresseddatabecomessensitivetotransmissionimpairments
suchasdelaysanderrors.
An effectivesolution to the QoS problemistodealwithit
both in the video CODEC (for example by introducing error-resilient features and matching
theratecontrolalgorithmtothechannel)and
in thenetwork(forexample
by adopting
protocols such as RTP). We described two popular transmission scenarios, digital television
broadcast and IP video conferencing, and their influence on video CODEC design.
The result
of takingthetransmissionenvironmentintoaccountis
a distinctlydifferentCODEC in
each case.
Video CODEC design is also heavily influenced by the implementation platform and in
the next chapter we discuss alternative platforms and their implications for the designer.
REFERENCES
1. Y. Wang, S. Wenger, J. Wen and A. Katsaggelos, Review of error resilient coding techniques for
real-time video communications, IEEE Signal Processing Magazine, July 2000.
2. B. Girod and N. Farber, Error-resilient standard-compliant video coding, from Recovery Techniques for Image and Video Compression, Kluwer Academic Publishers, 1998.
3. I. E.G.Richardson,Videocodingforreliablecommunications,Ph.D.
thesis, RobertGordon
University,1999.
4. J. Y. Liao and J. Villasensor, Adaptive intra block update for robust transmissionof H.263, IEEE
Trans. C S W , February 2000.
5. W. Kumwilaisak, J. Kim and C. Jay Kuo, Video transmission over wireless fading channels with
adaptive FEC, Proc. PCSOI, Seoul, April 2001.
6. M. Bystrom and J. Modestino, Recent advances in joint source-channel coding
of video, Proc.
URSI Symposium on Signals, Systems and Electronics, Pisa, Italy, 1998.
7. S. Tsekeridou and I. Pitas, PEG-2 error concealment based on block-matching principles, IEEE
Trans. Circuits and Systems for Video Technology, June 2000.
8. J. Zhang, J.F. Arnold and M. Frater, A cell-loss concealment technique for MPEG-2 coded video,
IEEE Trans. CSVT, June 2000.
9. B. Girod and N. Farber, Feedback-based error control for mobile video transmission, Proc.
IEEE
(special issue on video for mobile multimedia), 1999.
10. P-C. Chang and T-H. Lee, Precise and fast error tracking for error-resilient transmission of H.263
video, IEEE Trans. Circuits and Systems for Video Technology, June 2000.
1 1. N. Farber, B. Girod and J. Villasensor, Extensions of the ITU-T recommendation H.324 for errorresilient video transmission, IEEE Communications Magazine, June 1998.
12. Y. Yang, Rate control for video coding and transmission, Ph.D. thesis, Cornell University, 2000.
13. G. J. Conklin et al., Video coding for streaming media delivery
on the internet,IEEE Trans. CSVT,
11(3), March 2001.
14. ITU-T Recommendation H.320, Line transmission of non-telephone signals, 1992.
15. ITU-T Recommendation H.324, Terminal for low bitrate multimedia communication, 1995.
16. ITU-T Recommendation H.323, Packet based multimedia communications systems, 1997.
17. ISOlIEC 13818-1, Information technology: generic coding
of moving pictures and associated audio
information:Systems,1995.
Platforms
12.1 INTRODUCTION
In the early days of video coding technology, systems tended to fall into two categories,
dedicated hardware designs for real-time video coding(e.g. H.261 videophones) or software
designs for off-line (not real-time) image or video coding (e.g. JPEG compressioddecompression software). The continued increases in processor performance, memory density and
storage capacity have led to a blurring
of these distinctionsand video coding applications are
now implemented on a wide range of processing platforms. General-purpose platforms such
as personal computer (PC) processors can achieve respectable real-time coding performance
and benefit from economies of scale (i.e. widespread availability, good development tools,
competitive cost). There is still a need for dedicated hardware architecturesin certain niche
applications, such as high-endvideoencodingor
verylow powersystems. The middle
groundbetweenthegeneral-purposeplatform
and thededicatedhardwaredesign(for
applicationsthatrequiremoreprocessingpower
than ageneral-purposeprocessorcan
providebutwhereadedicateddesignis
not feasible) was, until recently, occupied by
programmable video processors. So-called media processors, providing support for
wider
functionalities such as audio and communications processing, are beginning to occupy this
market. There is currently a convergence
of processing platforms, with media extensions
and featuresbeingaddedtotraditionallydistinctprocessorfamilies(embedded,
DSP,
general-purpose) so that the choice of platform for video CODEC designs
is wider than
ever before.
In this chapter we attempt to categorise the main platform alternatives and to compare
their advantages and disadvantages for the designer of a video coding system. Of course,
some of the information in this chapter will be out of date by the time this book is published,
due to the rapid pace of development in processing platforms.
258
PLATFORMS
Manufacturer
Features
offering
Latest
Pentium Intel
PowerPC
Motorola
G4
AMD
12.2.1
Capabilities
PC processors can be loosely characterised as follows:
0
0
0
0
versions of theabove
processors are
support word lengths of 32 bits or more, fixed and floating point arithmetic;
support for SIMD instructions(for example carrying outthe same operation on 4 x 32-bit
words).
The popular PC operating systems (Windows and Mac O/S) support multi-tasking applications and offer good support for external hardware (via plug-in cards or interfaces such as
USB).
12.2.2 MultimediaSupport
Recenttrendstowardsmultimediaapplicationshave
led to increasing support for realtimemedia.Thereareseveral
frameworks that may be used within the Windows O/S,
for example, to assist in the rapid development and deployment of real-time applications.
The DirectX and Windows Media frameworks provide standardised interfaces and tools to
support efficient capture, processing, streaming and display of video and audio.
The increasing usage of multimedia has driven processor manufacturers to add architectural and instructionsupportfor
typical multimedia processing operations.The three
processor families listed in Table 12.1 (Pentium, PowerPC, Athlon) each support a version
of single instruction, multiple data (SIMD) processing. Intels MMXand SIMD extensions.* provide a number of instructions aimed at media processing. A SIMD instruction
operates on multiple data elements simultaneously (e.g. multiple 16-bit words within a 64bit or 128-bit register). This facilitates computationally intensive,repetitive operations such
PROCESSORS
GENERAL-PURPOSE
259
as motion estimation (e.g. calculating sum of absolute differences, SAD) and DCT (e.g.
multiply-accumulate operations). Figure 12.1 shows how the Intel instruction p sadbw
may be used to calculate SAD for eight
pairs of input samples(Ai,Bi) in parallel, leading toa
potentially large computational saving.
Table 12.2 summarises the main advantages and disadvantages of PC platforms for video
coding applications. The large user base and comprehensivedevelopment support make it an
attractive platform for applications such as desktopvideo conferencing (Figure 12.2) in
which a video CODEC is combined with a number of other components such as audio
CODEC, chat and document sharing to provide a flexible, low-cost video communication
system.
Disadvantages
260
PLATFORMS
Camera
Remote video
.,
Shared
Microphone
?
Figure 12.2 PC-basedvideoconferencing
lowlmediumcost;
limitedonspace);
and off-chipcodeanddatastorage(depending
on theavailableaddress
16- or32-bitwordlength.
Table 12.3listsafewpopularDSPs
and comparestheirfeatures:this
is onlyasmall
selection of the wide rangeof DSPs on the market.
As well as these discreteICs, a numberof
manufacturers provide DSP cores (hardware architectures designed to be integrated
with
other modules on a single IC).
261
PROCESSORS
SIGNAL DIGITAL
Table12.3
Popular DSPs
Manufacturer
Device
Features
Texas Instruments
C5000 series
C6000 series
Analog Devices
Motorola
ADSP-218x and
219x series
SHARC
DSP563xx
DSP568xx
A key feature of DSPs is the ability to efficiently carry out repetitive processing
algorithms such as filtering and transformation. This means that they are well suited to
many of thecomputationally intensive functions required of a typical DCT-based video
CODEC, such as motion estimation, DCT and quantisation, and some promising performance results have been r e p ~ r t e d .Because
~.~
a DSP is specifically designed for this type of
application, this performance usually comes without the penalty of high power consumption.
Support for related video processing functions (such as video capture, transmission and
rendering) islikely to belimited. The choice of application development tools is not as wide
as for the PC platform and high-level language support is often limited to the C language.
Table 12.4 summarises the advantages and disadvantages of the DSP platform for video
coding applications.
In a typical DSPdevelopmentscenario,codeis
developed on a host PC in C, crosscompiledanddownloaded to a developmentboardfor testing. The developmentboard
consists of a DSP IC together with peripherals such as memory, A/D converters and other
interfaces. To summarise, a DSP platform can provide good performance with low power
consumption but operating system and development support is often limited. DSPs may be a
suitable platform for low-power, special-purpose applications (e.g. a hand-held videophone).
Table12.4
Advantages
Disadvantages
262
PLATFORMS
12.4 EMBEDDEDPROCESSORS
The term embedded processor usually refers to a processor or controller that is embedded into a larger system, in order to provide programmable control and perhaps
processing
capabilities alongside more specialist, dedicated devices. Embedded processors are widely
used in communications (mobile devices, network devices, etc) and control
applications (e.g.
automotive control). Typical characteristics are:
0
lowpowerconsumption;
low cost;
limited wordlengths;
fixed-point arithmetic.
Until recently, an embedded processor would not have been considered suitable for video
coding applications becauseof severely limited processing capabilities. However, in common
withothertypes of processor, the processing power ofnew generations of embedded
processor continues to increase. Table 12.5 summarises the features of some popular embedded processors.
The popular ARM and MIPS processors are licensed as cores for integration into thirdparty systems. ARM is actively targeting low-power video coding applications, demonstrating 15 frames per second H.263 encodingand decoding (QCIF resolution) on an ARM9s and
developing co-processor hardware to further improve video coding performance.
Table 12.6 summarisestheadvantagesanddisadvantages
of embedded processors
for video coding applications. Embedded processors are of interest because of their large
market penetration (for example, in the high-volume mobile telephone market). Running
low-complexity video coding functionsin software on an embedded processor (perhaps with
limiteddedicatedhardwareassistance)
may be a cost-effective wayof bringing video
applications to mobile and wireless platforms. For example, the hand-held videophone is
seen as a key application for the emerging3G high bit-rate mobile networks. Video coding
on low-power embedded or DSP processors may be a key enabling technology for this type
of device.
Device
Features
MIPS
4K series
ARM
ARMnntel
ARM9 series
StrongARM series
263
MEDIA PROCESSORS
Disadvantages
Limited performance
Limited word lengths, arithmetic, address spaces
(As yet) few features to support video processing
12.5
MEDIA PROCESSORS
DSPs have certain advantages over general-purpose processors for video coding applications;
so-called media processors go a step further
by providing dedicated hardware functions
that support video and audio compression and processing. The general conceptof a media
processor is a core processor together with a number
of dedicated co-processors that carry
out application-specific functions.
The architectureof the Philips TriMedia platform shown
is
in Figure 12.3. The core
of the
TriMedia architecture is very
a
long instructionword (VLIW) processor. A VLIW processor
can carry out operations on multiple data words (typically four 32-bit words in the case of
the TriMedia) at the same time. This is a similar concept toSIMD
the instructions described
earlier (see for example Figure12.1) and is useful for video and audio coding applications.
Computationally intensive functions in a video CODEC such as motion estimation and DCT
may be efficiently carried out using VLIW instructions.
m
I
Timers
Audio V0
System bus
p
4
264
PLATFORMS
Disadvantages
The co-processors in the TriMedia architecture are designed to reduce the computational
burden onthecore
by carryingoutintensiveoperationsinhardware.
Available coprocessorunits,shown
in Figure 12.3, includevideoandaudiointerfaces,memoryand
external bus interfaces, an image co-processor and a variable-length decoder (VLD).
The
image co-processor is useful for pre- and post-processing operations such as scaling and
filtering, and the VLD can decode an MPEG-2 stream in hardware (but does not currently
supportothercodingstandards).Withcarefulsoftwaredesignandoptimisation,avideo
coding application running on the TriMedia can offer good performance at a modest clock
speed whilst retaining some of the benefits of a general-purpose processor (including the
ability to program the core processor in C or C++ software).6
The MAPprocessor developed by Equator and Hitachi is another media processor that has
generatedinterestrecently.Theheart
ofthe processorisa
VLIW core,surrounded by
peripheral units that deal with video
I/O, communications, video filtering and variable-length
coding. According to the manufacturer, the MAP-CA can achieve impressive performance
forvideocodingapplications,forexampleencodingMPEG-2Main
Profile/MainLevel
video at 30 frames per second using 63% ofthe available processing resource^.^ This is
higher than the reported performance of similar applications on the TriMedia.
Media processors haveyet to capture asignificant part of the market, and it is not yet clear
whetherthehalfwayhousebetweendedicatedhardware
and general-purposesoftware
platforms will beamarketwinner.Table
12.7 summarisestheirmainadvantages
and
disadvantages for video coding.
265
with lower bit-rate data such as compressed video, graphics and also coded audio. The
DRAM bus is connected to a dedicated video processor (the
VP6), a dynamicRAM interface
and video input and output ports. TheDRAM side deals with high bit-rate, uncompressed
video and with most of the computationally intensive video coding operations. Variablelengthencodinganddecodingarehandled
by dedicatedVLE and VLD modules. This
partitioned architecture enables the VCPex toachievegoodvideocoding
and decoding
performance with relatively low powerconsumption.Computationallyintensivevideo
coding functions (and pre- and post-processing) are handled by dedicated modules, but at
the same time the MSC and VP6 processors may be reprogrammed to support a range of
coding standards.
Table 12.8 summarises the advantagesand disadvantages of this type of processor. Video
signal processors do not appear to be a strong force in the video communications market,
Table 12.8 Advantages and disadvantages of video signal processors
Advantages
Disadvantages
Application-specific features
Limited programmability
266
PLATFORMS
Video
Video
JPEG
CODEC
l
Controller
O
d
d
Data
Data
Interface
Host
and
Host
267
CO-PROCESSORS
MPEG-4
video CODEC
-[__l
Filtering
4
Bit stream
Multiplexing
Network
and
Video
In
oV
L
Video
Host I/F
Out
Host
capture, displayand filtering are handled by co-processing modules. The IC is aimed lowat
power, low bit-rate video applications and can handle QCIF video coding and decoding
at 15
frames per second with a power consumption
of 240 mW. A reduced-functionality versionof
this chip, the TC35274, handles only MPEG-4 video decoding.
Table 12.9 summarises the advantages and disadvantages of dedicated hardware designs.
This type of CODEC is becoming widespread for mass market applications such as digital
television receivers andDVD players. One potential disadvantage is the reliance on a single
manufacturer in a specialist market;this is perhaps less likely to be a problem with generalpurpose processors and media processors as they are targeted at a wider market.
12.8 CO-PROCESSORS
A co-processor is a separate unit that is designed to work with a host processor (such as a
general-purposePCprocessor).
Theco-processor(oraccelerator)carriesoutcertain
computationally intensive functions in hardware, removing some
of the burden from the
host.
Advantages
High performance for video coding
Optimised for target video coding standard
Cost-effective for mass-market applications
Disadvantages
No support for other coding standards
268
PLATFORMS
vt
Host
Accelerator
Host CPU
Video
data
buffers
Frame
buffers
---+
Display
Host data
buffers
3. The accelerator carries out IDCT and motion compensation and writes the reconstructed
frame to a display buffer.
4. The display buffer is displayed on the PC
further reconstructed frames.
Table 12.10 lists the advantages and disadvantages of this type of system. The flexibility
of software programmability together with dedicated hardware support for key functions
makes itan attractive option for PC-based video applications. Developers should benefit
from the large PC market whichwill tend to ensure competitive pricing and performance for
the technology.
269
SUMMARY
Advantages
Disadvantages
12.9 SUMMARY
Table 12.11 attempts to compare the merits of the processing platforms discussed in this
chapter. It should be emphasised that the rankings in this table arenot exact and there will be
exceptions in a number of cases (for example, a high-performance DSP that consumes more
power than a media processor). However, the general trend is probably correct: the best
coding performance per milliwatt of consumed power should be achievable with a dedicated
hardware design, but on the other hand PC and embedded platforms are likely to offer the
maximum flexibility and the best development support due to their widespread usage.
The recenttrend is for a convergence between so-called dedicated media processors and
general-purpose processors, for example demonstratedby the developmentof SIMD/ VLIWtype functions forall the major PC processors. This trend is likely to continue as multimedia
applications and services become increasingly important. At thesame time, the latest
generation of videocoding standards (MPEG-4,H.263-t and H.26L) require relatively
complex processing (e.g. to support object-based coding and coding mode decisions), as
well as repetitive signal processing functions such as block-based motion estimation and
Table 12.11 Comparisonofplatforms(approximate)
ding
Video
Flexibility consumptionperformance
DedicatedBest
hardware
Video signal
processor
Media processor
PC processor
Digital signal
processor
Dedicated
hardware
Embedded
processor
Digital signal
processor
Video signal
processor
Media processor
PC processor
Embedded
Worst
processor
PC processor
PC processor
Embedded
processor
Digital signal
processor
Media processor
Embedded
processor
Digital signal
processor
Media processor
Video signal
processor
Dedicated
hardware
Video signal
processor
Dedicated
hardware
270
PLATFORMS
transform coding. These higher-level complexities are easier to handle in software than in
dedicatedhardware,and
it may bethatdedicatedhardwareCODECs
will becomeless
important(exceptforspecialist,high-endfunctions
such as studioencoding)and
that
general-purpose processors will take care
of mass-market video coding applications (perhaps
with media processors or co-processors to handle low-level signal processing).
In the next chapter we examine the
main issues that are facedby the designerof a software
or hardware video CODEC, including issues common to
both (such as interface requirements) and the separate design goals for a software or hardware CODEC.
REFERENCES
1. M. Mittal, A. Peleg and U. Weiser, MMX technology architecture overview, Intel Technology
Journal, 3rd Quarter, 1997.
2. J. Abel et al., Applications tuning for streaming SIMD extensions, Intel Technology Journal, 2nd
Quarter, 1999.
3. H. Miyazawa, H.263Encoder: TMS32OC6000 Implementation, Texas Instruments Application
Report SPRA721, December 2000.
4. K. Leung, N. Yung and P. Cheung, Parallelization methodology for video coding-an implementation
on the TMS320C80, IEEE Trans CSW, lO(X), December 2000.
5 . I. Thornton, MPEG-4 over Wireless Networks, ARM Inc. White Paper, 2000.
6. I. E. G. Richardson, K. Kipperman et al., Video coding using digital signal processors, Proc. DSP
World Conference, Orlando, November 1999.
7. C. Basoglu et al., The MAP-CA VLIW-based Media Processor, Equator Technologies Inc. White
Paper, January 2000. http://www.equator.com
8. G. Sullivan and C. Fogg, Microsoft Direct XVA: Video Acceleration APUDDI, Windows Platform
Design Note, Microsoft, January 2000.
13.2 VIDEOCODECINTERFACE
Figure 13.1 shows the main interfaces to a video encoder and video decoder:
Encoder input: frames of uncompressed video (from a frame grabber or other source);
control parameters.
Encoder output: compressedbitstream(adaptedforthetransmissionnetwork,see
Chapter 11); status parameters.
Decoder input: compressed bit stream; control parameters.
Decoderoutput:frames
parameters.
of uncompressedvideo(sendtoadisplayunit);status
A video CODEC is typically controlled by a host application or processor that deals with
higher-level application and protocol issues.
272
Contrd
Status
Network adaptation
(multiplexing,
packetising, etc.)
Encoder
Coded
data out
Video in
(a) Encoder
Host
Control
Status
Networkadaptation
1
data in
Video out
(b) Decoder
YUY2 (4 : 2 :2). The structure of this format is shown in Figure 13.2. A sample of Y
(luminance) data is followed by a sample of Cb (blue colour difference), a second
sample of Y, a sample of Cr (red colour difference), and so on. The result is that the
chrominance components have the same vertical resolution as the luminance component but half the horizontal resolution (i.e. 4 : 2 : 2 sampling as described in Chapter 2).
In the example in the figure, the luminance resolution is 176 x 144 and the chrominance resolution is 88 x 144.
YV12 (4 : 2 : 0) (Figure 13.3). The luminance samples for the current frame are stored
in sequence, followed by the Cr samples and then the Cb samples. The Cr and Cb
samples have half the horizontal and vertical resolution of the Y samples. Each colour
273
...
...
Figure 13.2 YUY2 (4 :2 :2)
pixel in the original image maps toan average of 12 bits (effectively 1 Y sample, Cr
sample and Cb sample), hence the name W12. Figure 13.4 shows an example of a
frame stored in this format, with the luminance array first followed by the half-width
and half-height Cr and Cb arrays.
(c) Separate buffers for each component(Y, Cr, Cb). The CODEC ispassed a pointer to the
start of each buffer prior to encoding or decoding a frame.
As well as reading the source frames (encoder) and writing the decoded frames (decoder),
both encoder and decoder require to store one or more reconstructed reference frames for
motion-compensatedprediction.Theseframestores
may bepart of theCODEC (e.g.
internallyallocatedarraysin
a softwareCODEC)orseparatefromtheCODEC
(e.g.
external RAM in a hardware CODEC).
Y (frame 1)
...
274
DESIGN
CODEC
VIDEO
Memory bandwidth may be a particular issue for large frame sizes andhigh frame rates.
For example, in order to encode or decode video at
television resolution (ITU-R 601,
approximately 576 x 704 pixels per frame, 25 or 30 frames per second), the encoder or
decoder video interface must be capable of transferring 216 Mbps. The data transfer rate
may be higher if the encoder or decoder stores
reconstructed frames in memory external to
the CODEC. If forward prediction is used, the encoder must transfer datacorresponding to
three complete frames for each encoded frame, as shown
in Figure 13.5: reading a new input
frame, reading a stored frameformotionestimation
and compensation and writing a
reconstructed frame. This meansthat the memory bandwidthat the encoder input is at least
3 x 216 = 6 4 8 Mbps for ITU-R 601 video. If two or more prediction references areused for
motion estimatiodcompensation (for example, during MPEG-2 B-picture encoding), the
memory bandwidth is higher still.
13.2.2
Coded video data is a continuoussequence of bits describing the syntax elements of coded
video, such as headers, transform coefficients and motion vectors. Ifmodified Huffman
coding is used, the bit sequence consists of a series of variable-length codes (VLCs) packed
together; if arithmetic coding is used, the bits describe a series of fractional numbers each
INTERFACE CODEC
interface
Encoder
275
VIDEO
External
memory
I
l
I
Current frame
__________________
I
Motion estimation
E-
Reference frame(s)
................................
,.....
.....................................
Reconstructed
frame
I
I
I
I
I
+l
Motion compensation
l
Reconstruction
I
I
I
I__________________
representing a seriesof data elements (see Chapter8). The sequenceof bits must be mapped
to a suitable data unit for transmissiodtransport, for example:
1. Bits: If the transmission channel is capable of dealing with an arbitrary number of bits,
no special mapping is required. Thismay be the case for a dedicated serial channel
but is
unlikely to be appropriate for most network transmission scenarios.
2. Bytes or words: The bit sequence is mapped to an integral number of bytes (8 bits) or
words (16 bits, 32 bits, 64 bits, etc.). This is appropriate for
many storage or transmission
scenarios where data is stored in multiples of a byte. Theend of the sequencemay require
to be padded in order to make up an integral number of bytes.
3. Complete coded unit: Partition the coded stream along boundaries that make up coded
units within the video syntax. Examplesof these coded units include slices (sectionsof a
coded picture in MPEG-l, MPEG-2, MPEG-4 or H.263+), GOBS (groups of blocks,
sections of a coded picturein H.261 or H.263) and completecoded pictures. Theintegrity
of the coded unit is preserved during transmission, for example by placing each coded
unit in a network packet.
276
2
Picture
1 ' 1
l 4 1
Coded data
Picture
4
-a0
Coded data
Figure 13.7
277
Encoder
Framerate Maybe specified as a number of frames per second or as a proportion of
frames to skip during encoding(e.g. skip every second frame). If the encoder is operatingin
a rate- or computation-constrained environment (see Chapter lo), then this will be a target
frame rate (rather than an absolute rate) that may or may not be achievable.
Frame size
For example, a standard frame size (QCIF, CIF, ITU-R 601, etc) or a non-
standard size.
Quantiser step size If rate control isnot used, a fixed quantiser step sizemay be specified:
this will give near-constant video quality.
Modecontrol
Optionalmodeselection
MPEG-2, MPEG-4 and H.263includeanumber
of optional
codingmodes (for improvedcoding efficiency, improved errorresilience,etc.).
Most
CODECs will only support a subset of these modes, and the choice of optional modes to
use (if any) must be signalled or negotiated between the encoder and the decoder.
Starvstop encoding A series of video frames.
Decoder
Most of the parameters listed above are signalledto the decoder within the coded bit stream
itself. For example, quantiser step size is signalled
in frame/picture headers and (optionally)
macroblock headers; frame rate is signalled by means of a timing reference in each picture
header; mode selection is signalled in the picture header; and
so on. Decoder controlmay be
limited to start/stop.
actualframerate
(may differfromthetargetframeratein
useful as statusparameters
rate- orcomputation-
constrained environments);
0
quantiser step size for each macroblock (thismay be useful for post-decoder filtering, see
Chapter 9);
278
0
DESIGN
CODEC
VIDEO
distribution of coded bits (e.g. proportion of bits allocated to coefficients, motion vector
data, header data);
error indication (returned by the decoder when a transmission error has been detected,
possibly with the estimated location of the error in the decoded frame).
1. Maximise encoded frame rate.A suitable target frame rate depends on the application, for
example, 12-15 frames per second for desktop video conferencingand 25-30 frames per
second for television-quality applications.
4. Maximise video quality (for a given bit rate). Within the constraints of a video coding
standard, there are usually many opportunities to tradeoffvideoqualityagainst
computational complexity, such asthevariablecomplexityalgorithmsdescribed
in
Chapter 10.
5. Minimise delay (latency) through the CODEC. This is particularly important for
two-way
applications (such as video conferencing) where low delay is essential.
6. Minimise compiled code and/or data size. This is important for platforms with limited
available memory (such as embedded platforms). Some features
of the popular video
coding standards (such as the use of B-pictures) provide high compression efficiency at
the expense of increased storage requirement.
7. Providea flexible API, perhapswithinastandardframework
Chapter 12).
279
Frame rate
Framesize
Figure 13.8 Trade-off of frame size and frame rate in a software CODEC
13.3.2
Based on the requirements of the syntax (for example, MPEG-2, MPEG-4 or H.263), an
initial partitionof the functions required to encode
and decode a frameof video can be made.
Figure 13.9 shows a simplified flow diagramforablocklmacroblock-basedinter-frame
encoder (e.g. MPEG-1,MPEG-2, H.263 or MPEG-4) and Figure 13.10 shows the equivalent
decoder flow diagram.
The order of some of the operations is fixed by the syntax of the coding standards. It is
necessary to carry out DCT and quantisation of each block within a macroblock before
generating theVLCs for the macroblock header: this is because the header typically contains
a coded block pattern field that indicates which of the six blocks actually contain coded
transform coefficients. Thereis greater flexibility in decidingthe order of some of the other
280
7
Starl (frame)
Pre-process
Code plcture
header
1 4
Motion estimate
and Compensate
macroblock
DCT, Quantlse
block
Repeat for
6 blocks
Rescale. IDCT
block
Repeat for
all macroblocks
Code macroblock
header + motion
Reorder, Run
Length Encode
Repeat for
6 blocks
Varlable Lecgth
Encode block
ReconstrUCt
macroblock
v
Start (frame)
Decode picture
header
macroblock header
Decode
~
Repeat
Variable Length
Decode
for all
macroblocks
Repeat for
6 blocks
Run Length
Decode,
Reorder
Rescale, IDCT
Reconstnct
macroblock
Post-process
f rarne
7
Figure13.10
1
i
281
282
Frames
DCT
_.+ Quantise
+ lgzag
RLE
!
VLE
.
+!
Frames
Reconstrict
t IDCT
*..,
t-!
*-J
i
Rescale
RLD,
+ Reorder
VLD
operations. An encoder may choose to canyout motion estimation and compensation for the
entire frame before carrying out the block-level operations (DCT, quantise, etc.), instead of
coding the blocks immediately after motion compensating the macroblock. Similarly, an
encoder or decoder may choose to reconstruct each motion-compensated macroblock either
immediately after decoding the residual blocks or after the entire residual frame has been
decoded.
The following principles can help to decide the structure of the software program:
modular.
2. Minimise data copying between functions (since each copy adds computation).
3. Minimise function-calling overheads. This may involve combining functions, leading to
less modular code.
4. Minimise latency. Coding and transmitting each macroblock immediately after motion
estimation and compensation can reduce latency. The coded data maybe transmitted
immediately, rather than waiting untilthe entire frame has been motion-compensated
before coding and transmitting the residual data.
283
5. Consider combining functions to reduce function calling overheads and data copies. For
example, a decoder carries out inverse zigzag ordering of a block followed by inverse
quantisation. Each operation involves a movement of data from one array into another,
together with the overhead of calling and returning from a function. By combining the
two functions, data movement and function calling overhead is reduced.
6. For computationally critical operations (such as motion estimation),consider
platform-specific optimisations such asinlineassemblercode,compilerdirectives
platform-specific library functions (such as Intels image processing library).
using
or
Applying some or all of these techniques can dramatically improve performance. However,
these approaches canlead to increaseddesign time, increased compiledcode size (for
example, due to unrolled loops) and complex software code that is difficult to maintain or
modify.
Example
An H.263 CODEC was developed for the TriMedia TMlOOO platform. After the first
pass of the software design process (i.e. without detailed optimisation), the CODEC ran
at the unacceptably low rate of 2 CIF frames per second. After reorganising the software
(combining functionsandremoving
interdependencies between data), executionspeed
was increased to 6 CIF frames per second. Applying platform-specific optimisation of
critical functions (using the TriMedia VLIW instructions) gave a further increase to 15
CIF frames per second (an acceptable rate for video-conferencing applications).
284
DESIGN
CODEC
VIDEO
13.3.5 Testing
In addition to the normal requirements for software testing, the following areas should
checked for a video CODEC design:
0
0
be
Interworking between encoder and decoder (if both are being developed).
Performance with arange of videomaterial(including live video if possible),since
somebugs may only show up undercertainconditions(forexample,
an incorrectly
decoded VLC may only occur occasionally).
Interworking with third-party encoder(s) and decoder(s). Recent video coding standards
have softwaretestmodelsavailablethataredevelopedalongsidethestandard
and
provide a useful reference for interoperability tests.
Decoder performance under error conditions,such as random bit errors and packet losses.
To aid in debugging, it can be useful to provide a trace mode in which each of the main
coding functions records its data to a log file. Without this type of mode, it can be very
difficult to identify the causeof a software error (say)by examining the stream of coded bits.
A real-timetestframework which enables live videofromacamera
to be coded and
decoded in real time using the CODEC under development can be very useful for testing
purposes, as can be bit-stream analysis tools (such as MPEGTool) that provide statistics
about a coded video sequence.
Someexamples of efficientsoftwarevideoCODECimplementations
have been disOpportunities have been examined for parallelising video coding algorithms for
multiple-processor platform^,^-^ and a method has been described for splitting a CODEC
implementationbetweendedicatedhardware
and software.8 In thenextsection wewill
discuss approaches to designing dedicated VLSI video CODECs.
285
DESIGN OF A HARDWARE
CODEC
7. Minimiseoff-chipdatatransfers(memorybandwidth)
performance bottleneck for a hardware design.
as thesecanoftenactasa
8. Provide aflexible interface to the host system (very often a processorrunning higher-level
application software).
In a hardware design, trade-offs occur between the first four goals (maximise frame rate/
frame size/peakbit rate/quality) and numbers (6) and (7) above (minimise gate count/power
consumption and memory bandwidth). As discussed in Chapters 6-8, thereare many
alternative architectures for the key coding functions such as motion estimation, DCT and
variable-length coding, but higher performance often requires an increased gate count. An
important constraint is the cycle budget for each coded macroblock. This can be calculated
based on the target frame rate and frame size and the clock speed of the chosen platform.
Example
Target framesize:QCIF(99macroblocksperframe,
Target framerate:30framespersecond
Clock
speed:
20 MHz
H.263MPEG-4 coding)
99 X 30 = 2970
20 x 106/2970 = 6374
This means that all macroblock operations must be completed within 6374 clock cycles.
If the various operations(motionestimation,compensation,
DCT, etc.)arecarried
out
serially then the sum total for all operationsmust not exceed this figure;if the operations are
pipelined (see below) then any one operation must not take more than 6374 cycles.
Motion Motion
estimator
compensator
l..:.
FDCT/IDCT
...etc.
interface
1$
m
Controller
286
E!
Controller
RAM
Motion
estimator
I
Motion
compensator
__+
FDCT
Quantise
Reorder l
RLE
---+
...etc.
architecture. This type of architecture may be flexible and adaptable but the performance
may be constrained by data transfer over the bus and scheduling of the individual processing
units. A fully pipelined architecture such
as the example in Figure13.13 has the potential to give
high performance due to pipelined execution by the separate functional units. However, this
type of architecture may require significant redesign in order to support a different coding
standard or a new optional coding mode.
A further consideration for a hardware design is the partitioning between the dedicated
hardware and the host processor. A co-processor architecture such as that described in the
DirectX VA framework (see Chapter 13) implies close interworking between the host and
the hardware on a macroblock-by-macroblock basis. An alternative approach is to move
more operations into hardware, for example by allowing the hardware to process a complete
frame of video independently of the host.
13.4.4 Testing
Testing and verification of a hardware CODEC can be a complicated process, particularly
sinceit may be difficult to test withreal video inputs until a hardware prototype is
available. It may be useful to develop a software model that matches the hardware design to
REFERENCES
287
assist in generating test vectors and checking the results. A real-time test bench, where a
hardware design is implemented on a reprogrammable FPGA in conjunction with a host system
and video capture/displaycapabilities, can support testing with a range of real video sequences.
VLSI video CODEC design approaches and examples have been reviewed9-* and two
specific design case studies presented. , I 2
13.5 SUMMARY
The design of a video CODEC depends on the target platform, thetransmission environment
and the userrequirements.However,
there aresomecommongoals
andgooddesign
practices that may be useful for a range of designs. Interfacing to a video CODEC is an
important issue, becauseof the need to efficiently handle a high bandwidth of video data in
real time and because flexible control of the CODEC can make a significant difference to
performance. There are many options for partitioning the design into functional blocks and
the choice of partition will affect the performance and modularity of the system. A large
number of alternative algorithmsand designs exist for eachof the main functions in a video
CODEC. A good design approach is to use simple algorithms where possible and to replace
these with more complex, optimised algorithms in performance-critical areas of the design.
Comprehensive testing with a range of video material and operating parameters is essential
to ensure that all modes of CODEC operation are working correctly.
REFERENCES
1. I. Richardson, K. Kipperman and G. Smith, Video coding using digital signal processors, DSP
288
DESIGN
VIDEO CODEC
11. P. Pirsch and H. -J. Stolberg, VLSI implementations of image and video multimedia processing
systems, IEEE Transactions onCircuits and Systernsfor Video Technology, 8(7),November 1998,
pp. 878-891.
12. A. Y. Wu, K. J. R. Liu, A. Raghupathy and S. C. Liu, System Architecture of a Massively Parallel
Programmable Video Co-Processor, Technical Report ISR TR 95-34, University of Maryland,
1995.
Future Developments
14.1 INTRODUCTION
This book hasconcentrated on the design of video CODECs that are compatible with current
standards (in particular, MPEG-2, MPEG-4 and H.263) and on the current state of the art in
video coding technology. Video coding is a fast-moving subject and current research in the
field moves beyond the bounds of the international standards; at the same time, improvements in processing technology will soon make it possible to implement techniques
that
were previously considered too complex. This final chapter reviews trends in video coding
standards, research and platforms.
14.2 STANDARDSEVOLUTION
The I S 0 MPEG organisationis at presentconcentrating on two mainareas:updates to
existing standards and a new standard, MPEG-21. MPEG-4 is a large and complex standard
with many functions and tools that go well beyond the basic H.263-like functionality of the
popularsimple profile CODEC. It was originallydesigned with continualevolution in
mind: as new techniques and applications become mature, extra tools and profiles continue
to be added to the MPEG-4 set of standards. Recent work, for example, has included new
profiles that support some of the emerging Internet-based applications for MPEG-4.Some of
the more advanced elements of MPEG-4 (such as sprite coding and model-based coding)are
not yet widely used in practice, partly for reasons of complexity. As these elements become
morepopular(perhaps
dueto increasedprocessorcapabilities),
it may be thattheir
description in the standard will need to be modified and updated.
MPEG-2 I builds on the coding tools of MPEG-4 and the content description tools of the
MPEG-7standardtoprovideaframeworkformultimediacommunication.
The MPEG
committee has moved beyond the details of coding and description to an ambitious effort
tostandardiseaspects
of thecompletemultimedia
delivery chain,fromcreationto
consumption(viewingorinteracting
with the data).This process may include the
standardisation of new coding and compression tools.
The Video Coding Experts Group of the ITU continues to develop the H . 2 6 ~series of
standards. The recently added Annexes V, W and X of H.263 are expected to be the last
major revisions to this standard. The main ongoing effort has to finalise the first version of
H.26L: the core tools of the standard (described in Chapter 5 ) are reasonably well defined,
but there is further work required to convert these into a published international standard.
The technical aspects of H.26L were scheduled to be finalised during 2001. However, there
290
FUTURE DEVELOPMENTS
22 papers on the implementation and optimisation of the popular block DCT-based video
coding standards;
l 1 papers on transmissionissues;
22 papers on content-basedandobject-basedcoding(includingMPEG-4object-based
coding);
TRENDS
APPLICATION
291
there will continue to be distinct classes of platform for video coding, possibly along
following lines:
1. PC processorswithmediaprocessingfunctionsandincreasing
processing (e.g. in video display cards).
the
use of hardwareco-
2. Morestreamlinedprocessors
(e.g. embeddedprocessors
with internal or external
multimedia support, or media processors) for embedded multimedia applications.
3. Dedicated hardware CODECs (with limited programmability) for efficient implementation of mass-market applications such as digital TV decoding.
There is still a place in the market for dedicated hardware designs but at thesame time there
is a trend towards flexible, embedded designs for new applications such as mobile multimedia. The increasing use of system on a chip (SoC) techniques,with which a complex IC
design can be rapidly put together from Intellectual Property building blocks, shouldmake it
possible to quickly reconfigure and redesign a dedicated hardware CODEC. This will be
necessary if dedicated designs are to continueto compete with the flexibility of embedded or
general-purpose processors.
14.5 APPLICATIONTRENDS
Predicting future directions for multimedia applications is notoriously difficult. Few of the
interactive applications that were proposed in the early 1990s, for example, have gained a
significant market presence. The largest markets for video coding at present are probably
digitaltelevisionbroadcasting
and DVD-video (both utilising MPEG-2 coding).Internet
video is gaining popularity, but is hampered by the limited Internet connections experienced
by most users. There are some signs that MPEG-4 coding for video compression, storage
and
playback may experience a boom in popularity similar to MPEG Layer 3 Audio (MP3
audio). However, much work needsto be doneon the managementand protection of
intellectual property rights before this can take place.
Videoconferencing
via theInternet(typicallyusingtheH.323protocolfamily)is
becoming more widely used and may gain further acceptance with increases in processor
and connection performance. It has yet to approach the popularity of communication via
voice,e-mail and textmessaging. There aretwoapplication
areas thatarecurrently
of
interest to developers and communications
providers, at opposite ends of the bandwidth
spectrum:
292
FUTURE DEVELOPMENTS
(b)Highdefinitiontelevision
(HDTV, approximatelytwicetheresolution
of ITU-R
601 standarddigitaltelevision).Codingmethods(part
of MPEG-2)havebeen
standardisedforseveralyears
but thistechnologyhas
not yet taken hold in the
marketplace.
(c) Digital cinema offers an alternative to the reels
of projector film that are still used for
distribution and display of cinema films. There is currently an effort by the MPEG
committee (among others) to develop standard(s) to support cinema-quality coding
of video and audio. MPEGs requirements document for digital cinema4 specifies
visuallylosslesscompression(i.e.
no loss should bediscernible by a human
observer in a movie theatre) of frames containing up to 16 million pixels at frame
ratesofupto
150 Hz.Incomparison,anITU-R
601 framecontainsaround
0.5 million pixels. Coding and decoding at cinema fidelity are likely to be extremely
demanding and will pose some difficult challenges for CODEC developers.
An interesting by-product of the mainstream video coding applications and standards is
the growing list ofnew and innovative applications for digital video. Some examples include
the use of live video in computer games; video chat on a large scale with multiple participants; video surveillance in increasingly hostile environments (such
as in an oil well or
inside the body of a patient); 3-D video conferencing; video conferencing for groups with
special requirements (for example deaf users); and many others.
Early experiences have taught designers of digital video applications that an application
will only be successful if users find it to be a usable, useful improvement over existing
technology.Inmanycasesthedesign
of theuserinterfaceis
as important as, ormore
important than, the efficiency of a video coding algorithm. Usability is
a vital but often
overlooked requirement for any new video-based application.
REFERENCES
293
interesting trends (for example, the continued popularityof MJPEG video CODECs because
of their design simplicity and inherent error resilience) imply that the video communications
market is likely to continue to be driven more by user needs than
by impressive research
developments. This in turn implies that only some
of the recent developments in video
coding (such as object-based coding, content-based tools, media processors and so on) will
survive.However,videocoding
will remainacoreelement
of thegrowingmultimedia
communicationsmarket.Platforms,algorithmsandtechniquesforvideocoding
will
continue to change and evolve. It is hoped that this book will help to make the subject of
video CODEC design accessible to a wider audience of designers, developers, integrators
and users.
REFERENCES
1.T.EbrahimiandM.Kunt,Visualdatacompressionformultimediaapplications:anoverview,
Proceedings of the IEEE, 86(6), June 1998.
2. ISOflEC JTCl/SC29/WGlI N4318, PEG-21 overview, Sydney, July 2001.
3. ITU-T Q6/SG16 VCEG-L45, H.26LTestModel Long-term number6 (TML-6) draft0, March 2001.
4. ISO/IEC JTCl/SC29/WGI 1 N4331, Digital cinema requirements, Sydney, July 2001.
Bibliography
1. Bhaskaran, V. and K. Konstantinides, Image and Video Compression Standards: Algorithms and
Architectures, Kluwer, 1997.
2. Ghanbari, M. Video Coding: An Introduction to Standard Codecs, IEE Press, 1999.
3. Girod, B., G. Greiner and H. Niemann (eds), Principles of 30 Image Analysis and Synthesis,
Kluwer, 2000.
4. Haskell, B., A. Puri and A. Netravali, Digital Video: An Introduction to MPEG-2,Chapman & Hall,
1996.
5 . Netravali, A. and B. Haskell, Digital Pictures: Representation, Compression and Standards, Plenum
Press, 1995.
6. Parhi, K. K. and T. Nishitani (eds), Digital Signal Processing for Multimedia Systems, Marcel
Dekker, 1999.
7. Pennebaker, W. B. and J. L. Mitchell, JPEG: Still Image Data Compression Standard, Van Nostrand
Reinhold, 1993.
8. Pennebaker, W. B., J. L. Mitchell, C . Fogg and D. LeGall, MPEG Digital Wdeo Compression
Standard, Chapman & Hall, 1997.
9. Puri, A. and T. Chen (eds), Multimedia Systems, Standards and Networks, Marcel Dekker, 2000.
10. Rao, K. R. and J. J. Hwang, Techniques and Standards for Image, Video and AudioCoding,
Prentice Hall, 1997.
11. Rao, K. R. and P. Yip, Discrete Cosine Transform, Academic Press, 1990.
12. Riley, M. J. and I. G. Richardson, Digital Video Communications, Artech House, February 1997.
Glossarv
4 : 2 :0 (sampling)
4 :2 :2 (sampling)
4 : 4 :4 (sampling)
API
arithmetic coding
artefact
BAB
baseline (CODEC)
block matching
blocking
B-picture
channel coding
chrominance
CIF
CODEC
colour space
DCT
DFD
DPCM
DSCQS
DVD
DWT
entropy coding
error concealment
field
flowgraph
full search
GOB
GOP
H.261
H.263
H.26L
HDTV
Huffman coding
298
HVS
inter-frame (coding)
interlaced (video)
intra-frame (coding)
IS0
ITU
ITU-R 601
JPEG
JPEG-2000
KLT
latency
loop filter
MCU
media processor
memory bandwidth
MJPEG
motion compensation
motion estimation
motion vector
MPEG
MPEG- 1
MPEG-2
MPEG-4
objective quality
OBMC
profile
progressive (video)
pruning (transform)
PSNR
QCIF
QoS
quantise
rate control
rate-distortion
RGB
ringing (artefacts)
RTP
RVLC
scalable coding
short header
(MPEG-4)
SIMD
slice
statistical redundancy
subjective quality
subjective redundancy
sub-pixel (motion
compensation)
GLOSSARY
human visual system, the system by which humans percieve and interpret
visual images
coding of video frames using temporal prediction or compensation
video data represented as a series of fields
coding of video frames without temporal prediction
International Standards Organisation
International Telecommunication Union
a colour video image format
Joint Photographic Experts Group, a committee of ISO; also an image
coding standard
an image coding standard
Kamuhen-Loeve transform
delay through a communication system
spatial filter placed within encoding or decoding feedback loop
multi-point control unit, controls a multi-party conference
processor with features specific to multimedia coding and processing
Data transfer rate to/from RAM
System of coding a video sequence using JPEG intra-frame compression
prediction of a video frame with modelling of motion
estimation of relative motion between two or more video frames
vector indicating a displaced block or region to be used for motion
compensation
Motion Picture Experts Group, a committee of I S 0
a video coding standard
a video coding standard
a video coding standard
visual quality measured by algorithm(s)
overlapped block motion compensation
a set of functional capabilities (of a video CODEC)
video data represented as a series of complete frames
reducing the number of calculated transform coefficients
peak signal to noise ratio, an objective quality measure
quarter common intermediate format
quality of service
reduce the precision of a scalar or vector quantity
control of bit rate of encoded video signal
measure of CODEC performance (distortion at a range of coded bit rates)
red/green/blue colour space
ripple-like artefacts around sharp edges in a decoded image
real-time protocol, a transport protocol for real-time data
reversible variable length code
coding a signal into a number of layers
a coding mode that is functionally identical to H.263 (baseline)
single instruction multiple data
a region of a coded picture
redundancy due to the statistical distribution of data
visual quality as perceived by human observer(s)
redundancy due to components of the data that are subjectively insignificant
motion-compensated prediction from a reference area that may be formed
by interpolating between integer-valued pixel positions
GLOSSARY
test model
TSS
VCA
VCEG
video packet
(MPEG-4)
video processor
VLC
VLD
VLE
VLIW
VLSI
V 0 (MPEG-4)
VOP (MPEG-4)
VQEG
YCrCb
299