Data Compression Introduction
Data Compression Introduction
Data Compression Introduction
Prof. Ja-Ling Wu
2
– Definition:
The expected length L(c) of a source code c(x) for a r.v. X
with probability mass function p(x) is given by
Lc px l x
xX
3
– Definition:
A code is said to be non-Singular if every element of the
range of X maps into a different string in D*, i.e.,
Xi Xj c(xi) c(Xj)
4
– Definition:
An extension C* of a code C is a mapping from finite length
strings of X to finite length strings of D*, defined by
– Ex:
If c(x1) = 00 and c(x2) = 11
then c(x1x2) = 0011
5
– Definition:
A code is called uniquely decodable if its extension is non-singular.
In other words, any encoded string in a uniquely decodable
code has only one possible source string producing it.
– Definition:
A coded is called a prefix code or an instantaneous code if no
codeword is a prefix of any other codeword.
Non-singular codes
6
KRAFT INEQUALITY:
Goal of source coding:
constructing instantaneous codes of minimum
expected length to describe a given source.
Theorem (Kraft inequality)
For any instantaneous code (prefix code) over an
alphabet of size D, the codeword lengths l1, l2,…, lm
must satisfy the inequality
1
D
i
li
7
Proof:
Consider a D-ary tree in which each node has D
children. Let the branches of the tree represent the
symbols of the codeword. For example, the D
branches from the root represent the D possible
values of the first symbol of the codeword. Then
each codeword is represented by a leaf on the
tree. The path from the root traces out the symbols
of the codeword.
8
The prefix condition on the codewords implies that no
codeword is an ancestor of any other codeword on the
tree. Hence each codeword eliminates its descendants
as possible codewords.
9
D lmax li
D lmax
or
1
D li
0 Total # of nodes at
level lmax = 24
1
Total # of descendants of
Root node node (level 2) at
level lmax = 4 = 24-2
0 1
0
node 0
1
0
1
1
i 1
12
Proof:
Let the D-ary alphabet be {0, 1, …, D-1}. Consider the
i-th codeword y1, y2, …, yl . Let 0.y1y2…yl be the real
i i
number given by the D-ary expansion
li
O y1 yli y j D j
j 1
13
By the prefix condition, those intervals are disjoint.
Hence the sum of their lengths has to be less than or
equal to 1. This prove that
1
D li
i 1
14
Optimal codes:
Minimize L=pili
Over the integers l1, l2,…, lm … satisfying
D
i 1
li
1
15
We neglect the integer constant on li and assume
equality in the constraint. Lagrange multiplier
J pi li D li
J
pi D li log D 0
li
pi
D li
log D
substituting this in the constraint to find ,
we find 1
log D and hence
pi D li
yielding optimal code lengths
li* log D pi
16
This non-integer choise of codeword lengths yields
expected codeword length
L* pi li* pi log D pi H D x
but since the li must be integers, we will not always
be able to set the codeword lengths as in eqn. ().
Instead, we should choose a set of codeword lengths
li “close” to the optimal set.
17
Theorem:
The expected length L of any instantaneous D-ary
code for a r.v. X is greater than or equal to the entropy
HD(X), i.e.,
L HD(X)
l
with equality iff D i pi .
Proof:
The difference between the expected length and the entropy
can be written as:
L H D x pi li pi log D
1
pi
pi log D D li pi log D pi
18
D li
Letting ri and C D li
, we obtain
D
l j
pi
L H pi log D log D c
ri
D p || r log D
1
0
c
L H D x with equality iff pi D li ,
i.e., iff log D pi is an integer for all i.
19
Bounds on the Optimal Codelength
Since logD1/pi may not equal an integer, we round it up
to give integer word length assignments,
1
li log D
pi
where x is the smallest integer x.
These lengths satisfy the Kraft inequality since
1 1
logD logD
D
D pi 1
pi pi
H D x L H D x 1
*
21
From the preceding theorem, there is an overhead
which is at most 1 bit, due to the fact that log1/pi is
not always an integer. One can reduce the overhead
per symbol by spreading it out over many symbols:
22
Ln 1
n px , x ,, x l x , x ,, x
1 2 n 1 2 n
1n El X 1 , X 2 , , X n
H X 1 , X 2 ,, X n El X 1 , X 2 , , X n H X 1 , X 2 , , X n 1
Since X 1 , X 2 , , X n are i.i.d.,
H X 1 , X 2 , , X n H X i nH X
H X Ln H X 1n
by using large block lengths, we can achieve an expected
codelength per symbol arbitrarily close to the entropy.
if X 1 , X 2 , , X n are not i.i.d.,
H X 1 , X 2 ,, X n H X 1 , X 2 ,, X n 1
Ln
n n n
23
Theorem: The minimum expected codeword length
per symbol satisfies:
H X 1 , X 2 ,, X n * H X 1 , X 2 ,, X n 1
Ln
n n n
Moreover, if X1, X2,…, Xn is a stationary stochastic
process.
Ln* H(X)
where H(X) is the entropy rate of the process.
24
Theorem: The expected length under p(x) of the code
assignment l x log 1 satisfies
q x
H p D p || q E pl x H p D p || q 1
26
Theorem:
The codeword lengths of any uniquely decodable code must
satisfy the Kraft inequality
1
D li
i 1
conversely, given a set of codeword lengths that satisfy this
inequality, it is possible to construct a uniquely decodable code
with these codeword lengths.
Proof:
Consider CK, the K-th extension of the code, i.e., the code formed
by the concatenation of K repetitions of the give uniquely
decodable code C.
D l x1 l x2
D D l xk
x1 , x2 ,, xk X K
D l x k
x K X K
kl max
a m D m
m 1
(or k) 28
where lmax is the maximum codeword length and a(m)
is the number of source sequences xk mapping into
codewords of length m.
m 1
klmax
29
Hence
D klmax k
l j 1
j
Since this inequality is true for all k, it is true in the limit
as k. Since (klmax)1/k 1, we have
D
l j
1
j
which is the Kraft inequality.
30
Corollary:
A uniquely decodable code for an infinite source
alphabet X also satisfies the Kraft inequality.
– Any subset of uniquely decodable code is also
uniquely decodable; hence, any finite subset of the
infinite set of codewords satisfies the Kraft
inequality. Hence,
N
D
i 1
li
lim D li 1
N
i 1
31
Huffman Codes
D. A. Huffman, “A method for the construction of
minimum redundancy codes,” Proc. IRE, vol. 40; pp.
10981101, 1952.
Example:
32
We expect the optimal binary code for X to have the
longest codewords assigned to the symbols 4 and 5.
Both these lengths must be equal, since otherwise we
can delete a bit from the longer codeword and still
have a prefix code, but with a shorter expected length.
In general, we can construct a code in which the two
longest codewords differ only in the last bit.
33
Example:
If D3, we may not have a sufficient number of symbols so that
we can combine them D at a time. In such a case, we add
dummy symbols to the end of the set of symbols. The dummy
symbols have probability 0 and are inserted to fill the tree.
Since at each stage of the reduction, the number of symbols is
reduced by D-1, we want the total number of the symbols to be
1+k(D-1), where k is the number of levels in the tree.
Codeword X Probability
1 1 0.25 0.25 0.5 1.0
2 2 0.25 0.25 0.25
01 3 0.2 0.2 0.25
02 4 0.1 0.2
000 5 0.1 0.1
001 6 0.1
002 Dummy 0.0 34
Optimality of Huffman codes
There are many optimal codes:
inverting all the bits or exchanging two codewords of
the same length will give another optimal code.
w.l.o.g., we assume P1 P2… Pm
Lemma:
For any distribution, there exists an optimal
instantaneous code (with minimum expected length)
that satisfies the following properties:
1. If Pj > Pk then lj lk
2. The two longest codewords have the same length.
3. The two longest codewords differ only in the last bit
and correspond to the two least likely symbols.
35
Proof:
– If Pj > Pk then lj lk
Consider C’m, with the codewords j and k of Cm interchanged.
Then
LC 'm LCm Pi l 'i Pi li
Pj lk Pk l j Pj l j Pk lk Pj Pk lk l j
But Pj Pk 0, and since C m is optimal, LC 'm LCm 0
Hence we muct have lk l j . Thus C m itself satisfies
property1.
36
If the two longest codewords are not of the same
length, then one can delete the last bit of the longer
one, preserving the prefix property and achieving
lower expected codeword length. Hence the two
longest codewords must have the same length. By
property 1, the longest codewords must belong to the
least probable source symbols.
If there is a maximal length codeword without a sibling,
then we can delete the last bit of the codeword and
still satisfy the prefix property. This reduces the
average codeword length and contradicts the
optimality of the code. Hence every maximal length
codeword in any optimal code has a sibling.
37
Now we can exchange the longest length codewords
so that the two lowest probability source symbols are
associated with two sibling on the tree. This does not
change the expected length Pili . Thus the
codewords for the two lowest probability source
symbols have maximal length and agree in all but the
last bit.
If P1 P2… Pm, then there exists an optimal
code with l1 l2 … lm-1 = lm, and codewords
c(Xm-1) and C(Xm) differ only in the last bit.
38
For a code Cm satisfying the properties of the above
lemma, we now define a “merged” code Cm-1 for m-1
symbols as follows: take the common prefix of the two
largest codewords (corresponding to the two least
likely symbols), and allot it to a symbol with probability
Pm-1 and Pm. All the codewords remain the same.
Cm1 Cm
P1 w'1 l '1 w1 w'1 l1 l '1
P2 w'2 l '2 w2 w'2 l2 l '2
Pm 2 w'm 2 l 'm 2 wm 2 w'm 2 lm 2 l 'm 2
Pm 1 Pm w'm 1 l 'm 1 wm 1 w'm 1 0 lm 1 l 'm 1 1
wm w'm 11 lm l 'm 1 1
39
where w denotes a binary codeword and l denotes its
length. The expected length of the code Cm is
m
LCm pi li
i 1
m2
pi l 'i pm 1 l 'm 1 1 pm l 'm 1 1
i 1
m2
pi l 'i pm 1 pm l 'm 1 pm 1 pm
i 1
LCm 1 pm 1 pm
40
Thus the expected length of the code Cm differs from
the expected length of Cm-1 by a fixed amount
independent of Cm-1. Thus minimizing the expected
length L(Cm) is equivalent to minimizing L(Cm-1). Thus
we have reduced the problem to one with m-1 symbols
and probability masses (P1, P2,…, Pm-1+ Pm).
we again look for a code which satisfies the properties
of the lemma for there m-1 symbols and then reduce
the problem to find the optimal code for m-2 symbols
with the appropriate probability masses obtained by
merging the two lowest probabilities on the previous
merged list.
41
Proceeding this way, we finally reduce the problem to
two symbols, for which the solution is obvious, i.e.,
allot 0 for one of the symbols and 1 for the other.
Since we have maintained optimality at every stage in
the reduction, the code constructed for m symbols is
optima.
42
Theorem:
Huffman coding is optimal, i.e., if C* is the Huffman
code and C’ is any other code, then L(C*)L(C’).
43