Uniquely Decodable Codes
Uniquely Decodable Codes
Uniquely Decodable Codes
Reading:
1. Read Chapter 5.
We are now ready to use the tools we have been building over the last few weeks
to work on the problem of efficient representation of data: data compression. In
order the made usable coding representations, we introduce a type of codes known as
instantaneous codes, which can be decoded without any backtracking. We present
the Kraft inequality, which is an important result on the lengths of codewords.
Then we show how to achieve a lower bound and introduce Huffman coding.
non-singular, but
uniquely decodable
0
010
01
10
instantaneous
0
10
110
111
Take the uniquely-decodable but non-instantaneous code: if the first two bits
are 11, then we must look at following bits. If the next bit is a 1 then the first
symbol is 3. If the length of the string of 0s following the 11 is odd, then the first
codeword must be 110 and the first source symbol must be 4. If the length of the
string of 0s is even, the first source symbol must be 3.
2
Kraft Inequality
Dli 1.
i=1
Conversely, given a set of codeword lengths that satisfy this inequality there exists
an instantaneous code with these word lengths.
Proof Consider a D-ary tree representing the codewords: the path down the tree
is the sequence of symbols, and each leaf of the tree (with its unique associate
path) corresponds to a codeword. The prefix condition implies that no codeword
is an ancestor of any other codeword on the tree: each codeword eliminates its
descendants as possible codewords.
Let lmax be the length of the longest codeword. Of all the possible nodes at a
level of lmax , some may be codewords, some may be descendants of codewords, and
some may be neither. A codeword (leaf node) at level li has Dlmax li descendants
at level lmax . Each of the descendant sets must be disjoint (because of the tree
structure). The total number of possible leaf nodes at level lmax is Dlmax . Hence,
summing over all codewords,
X
Dlmax li Dlmax .
all codewords
That is,
X
Dli 1.
Optimal codes
We deem a code to be optimal if it has the shortest average codeword length. The
goal, after all, is to use the smallest number of bits to send the information. This
may be regarded as an optimization problem. In designing the code, we must select
the codeword lengths l1 , l2 , . . . , lm so that that average length
X
L=
pi l i
i
is as short as possible (less than any other prefix code), subject to the constraint
that the lengths satisfy the Kraft inequality (so it will be a prefix code). That is,
minimize
X
L=
pi l i
i
subject
X
Dli 1.
We will make two simplifying assumptions to get started: (1) we will neglect integer
constraints on the codelengths; and (2) we will assume Kraft holds with equality.
Then we can write a Lagrange-multiplier problem
X
X
J=
pi l i +
Dli .
i
pj
log D
pi
=1
log D
so = 1/ log D, and
pi = Dli
and the optimal codelengths are li = logD pi . (The denotes the optimal value.)
Under this solution, the minimal average codeword length is
X
L =
pi li = HD (X).
i
L HD (X) =
i
Xi
pi
pi logD li
D
i
X
pi
=
pi logD li P lj P lj
D
(
D
)/( j D )
j
i
P
P
Now let ri = Dli / j Dlj and c = i Dli we have
X
pi
L HD (X) =
pi logD
logD c
ri
i
1
= D(pkr) + logD
c
0
The theorem just proved shows that the length must be greater than HD (X). We
can now prove that a physically implementable instantaneous code (that is, a code
with integer codeword lengths), we can find an upper bound on the code length:
H(X) L < H(X) + 1.
That is, the overhead due to the integer codeword length it not more than one bit.
The codeword lengths are found by
1
li = dlogD
e
pi
where dxe is the smallest integer x. These codeword lengths satisfy the Kraft
inequality:
X
X dlog 1 e X log 1
X
pi
pi
Dli =
D
D
=
pi = 1
i
1
1
li < logd
+1
pi
pi
p(x1 , x2 , . . . , xn )l(x1 , x2 , . . . , xn ) =
xX n
1
E(X1 , X2 , . . . , Xn ).
n
1
.
n
By choosing the block size sufficiently large, the average code length can be made
arbitrarily close to the entropy.
The next observation is that if the symbols are not independent, we can still
write
H(X1 , X2 , . . . , Xn ) El(X1 , X2 , . . . , Xn ) H(X1 , X2 , . . . , Xn ) + 1.
Dividing through by n we obtain
H(X1 , X2 , . . . , Xn )
H(X1 , X2 , . . . , Xn )
1
Ln
+ .
n
n
n
The set of instantaneous codes is smaller than the set of uniquely decodable codes,
so we might think that we might be able to obtain a lower average codeword L for
uniquely decodable codes. However, the point of this section is that this is not the
case. Hence, we may as well just use instantaneous codes, since they are easier to
decode.
Huffman codes
Huffman codes are the optimal prefix codes for a given distribution. Whats more,
if we know the distribution, Huffman codes are easy to find. The code operates
from the premise of assigning longer codewords to less-likely symbols, and doing it
in a tree-structured way so that the codes obtained are prefix-free.
Example 6 Consider the distribution X taking values in the set X = {1, 2, 3, 4, 5}
with probabilities .25, .25, .2, .15, .15, respectively.
At each stage of the development, we combine the two least-probable symbols
(in this case, for a binary code) into one symbol.
1. At the first round, then, the .15 and .15 are combined to form a symbol with
probability .3. The set of probabilities (in ordered form) is .3, .25, .25, .2.
2. Now combine the lowest two probabilities: .2 + .25 = .45. The ordered list of
probabilities is .45, .3, .25.
3. Combine the two lowest probabilities: .25 + .3 = .55. The ordered list of
probabilities is .55, ..45.
4. Combine these to obtain the total probability of 1. We are done!
Now assign codewords on the tree. The average codelength is 2.3 bits.
Codes with more than D = 2 symbols can also be built, as described in the
book.
Proving the optimality of the Huffman code begins with the following simple
lemma:
Lemma 1 For any distribution, there exists an optimal instantaneous code (of
shortest average length) that satisfies the following properties:
1. If pj > pk , then lj lk .
2. The two longest codewords have the same length.
3. The two longest codewords differ only in the last bit, and correspond to the
two least-likely symbols.
Proof (Sketch)
1. Simply swap lengths.
2. If the two longest are not of the same length, trim the longer: still a prefix
code.
3. If not siblings on the tree (i.e., do not differ just in one bit), then we can remove
a bit from the longest code, which contradicts the optimality property.
2
For a code on m symbols, assume (w.o.l.o.g.) that the probabilities are ordered
p1 > p2 > > pm . Define the merged code on m 1 symbols by merging the
two least probable symbols pm , pm1 . The codeword on this merged symbol is the
common prefix on the two least-probable (longest) codewords, which, by the lemma,
exists. The expected length of the code Cm is
L(Cm ) =
=
=
=
m
X
pi l i
i=1
m2
X
i=1
m2
X
i=1
m1
X
i=1
0
= L(Cm1 ) + pm (lm
+ 1).
The optimization problem on m symbols has been reduced to an optimization problem on m 1 symbols. Proceeding inductively, we get down to two symbols, for
which the optimal code is obvious: 0 or 1.
Arithmetic coding
Go over idea of arithmetic coding. Start with p0 = .75, p1 = .25 and explain
procedure for encoding and decoding. Then take case of p0 = 0.7 and p1 = 0.3.
Generalize to multiple ( 2) symbols.
Discuss numerical problems.
Handouts.
Run-length coding
Suppose you want to encode the information on a regular fax machine. There are
two possible outputs, white or black. Most paper consists of a white background
with black lettering, so the proportion of white letters tends to be quite large.
However, when black appears, it often appears as a run.
One potential way of doing data compression on such a source is by means of
run-length coding. Sequences of runs are encoded into a count. We have to worry
about how long the sequences can be, so there is a maximum run-length allowable.
What fax machines actually do is run-length coding following by Huffman coding. This is called Modified Huffman Coding.
Lempel-Ziv Coding
Lempel-Ziv coding (and its variants) is a very common data compression scheme.
It is based upon the source building up a dictionary of previously-seen strings, and
transmitting only the innovations while creating new strings.
For the first example, we start off with an empty dictionary, and assume that
we know (somehow!) that the dictionary will not contain more than 8 symbols.
Suppose we are given the string
1011010100010...
The source stream is parsed until the shortest string is encountered that has not
been encountered before. Since this is the shortest such string, all of its prefixes
must have been sent before. The string can be coded by sending the index from
the dictionary of the prefix string and the new bit. This string is then added to the
dictionary.
To illustrate, 1 has not been seen before, so we send the index of its prefix (set
at 000), then the number 1. We add the sequence 1 to the dictionary. Then 0 has
not been seen before, so we send the index of its prefix (000) and the number 0.
We add the sequence 0 to the dictionary. The sequence 11 has a prefix string of 1,
so we send its index, and the number 1. Proceeding this way, the dictionary looks
like this:
The Lempel-Ziv string dictionary
index contents
000
null
001
1
010
0
011
11
100
01
101
010
110
00
111
10
1
l(X1 , X2 , . . . , Xn ) H(X )
n
with probabilility 1.
10
10
The Huffman coding algorithm works when the source is stationary and the probabilities are known. In the circumstance in which the source is non-stationary or
the probabilities are not known in advance, then the adaptive Huffman coding algorithm is a possibility. In this case, the relative probabilities of the symbols are
estimated by keeping counts of the occurrences of the source symbols. When the
counts reach a point that the tree is no longer optimal, it is shifted to provide a
new Huffman code. Since the operation can take place simultaneously at both the
transmitter and the receiver, decoding can take place. A paper will be handed out
describing the technique. (This might make a good paper project.) It is believed
(by me) that the method is flawed.