Huffman Coding
Huffman Coding
Huffman Coding
Lawrence M. Brown
Huffman Coding†
• Huffman Coding
• Variable Length Encoding
• Building a Tree
• Decoding
• Encoding
25 September, 1999
1
Huffman Coding
Lawrence M. Brown
Huffman Coding
• Huffman Coding is a variable-length prefix encoding algorithm for
compression of character streams.
1Shaffer, Clifford A., A Practical Introduction to Data Structures and Algorithm Analysis, Java Edition, Prentice Hall (1998).
25 September, 1999
2
Huffman Coding
Lawrence M. Brown
Data Representation
Bits and Bytes
010010112 = 0 × 2 7 + 1× 2 6 + 0 × 25 + 0 × 2 4 + 1× 23 + 0 × 2 2 + 1× 21 + 1× 2 0
= 1× 64 + 1× 8 + 1× 2 + 1× 1
= 6410 + 810 + 210 + 110
= 7510.
Longer words (16-bit, 32-bit, 64-bit) are constructed from 8-bit bytes.
25 September, 1999
3
Huffman Coding
Lawrence M. Brown
25 September, 1999
4
Huffman Coding
Lawrence M. Brown
Variable-length Encoding
• Unicode and ASCII are fixed-length encoding schemes. All characters
require the same amount of storage (16 bits and 8 bits, respectively).
25 September, 1999
5
Huffman Coding
Lawrence M. Brown
• Next, a Binary Tree is built in which the external nodes store the
character and the corresponding character frequency observed in
the document.
25 September, 1999
6
Huffman Coding
Lawrence M. Brown
Character C D E F K L U Z
Frequency 32 42 120 24 7 42 37 2
• Step 1:
2 7 24 32 37 42 42 120
Z K F C U L D E
25 September, 1999
7
Huffman Coding
Lawrence M. Brown
• A new Binary Tree is created with the lowest-key Item as the left
external node, and the second lowest-key Item as the right external
node.
• The new Tree is then inserted back into the priority queue.
• Step 2:
24 32 37 42 42 120
9 F C U L D E
2 7
Z K
25 September, 1999
8
Huffman Coding
Lawrence M. Brown
37 42 42 120
• Step 3: 32
C
33 U L D E
9 24
F
2 7
Z K
• Step 4: 37 42 42 120
U L D 65 E
32
33
C
9 24
F
2 7
Z K
25 September, 1999
9
Huffman Coding
Lawrence M. Brown
42 120
D 65 79 E
32 37 42
33
C U L
9 24
F
2 7
Z K
25 September, 1999
10
Huffman Coding
Lawrence M. Brown
306
120 186
E
79 107
37 42 42 65
U L D
32 33
C
9 24
F
2 7
Z K
25 September, 1999
11
Huffman Coding
Lawrence M. Brown
25 September, 1999
12
Huffman Coding
Lawrence M. Brown
Decoding
• To decode a bit stream (from the leftmost bit), start at the root node of the Tree:
• move to the left child if the bit is a “0”.
• move to the right child if the bit is a “1”.
• When an external node is reached, the character at the node is sent to the
decoded string.
• The next bit is then decoded from the root of the tree.
306
0 1
120 186
E
0 1
Decode:
79 107
1011001110111101 0 1
0 1
L1001110111101
37 42 42 65
L U1110111101 U L D
0 1
L U C111101
L U C K 32
33
C
0 1
9 24
F
0 1
2 7
Z K
25 September, 1999
13
Huffman Coding
Lawrence M. Brown
Encoding
• Create a lookup table storing the binary code corresponding to the path
to each letter.
• If encoding ASCII text, an 128-element array would suffice.
String[] encoder = new String[128];
encoder[‘C’] = “1110”;
Analysis
• Define fi = frequency of letter li, i = 1, … , n.
• Define ci = cost for each letter li (number of bits).
∑c f i i
• Expected cost per character, ECPC = i =1
n
bits/character.
∑f
i =1
i
32 + 42 + 120 + 24 + 7 + 42 + 37 + 2 characters
D 42 110 3
E 120 0 1
F 24 11111 5
K 7 111101 6
L 42 101 3 ≈ 2.57 bits/character
U 37 100 3
Z 2 111100 6
25 September, 1999
15
Huffman Coding
Lawrence M. Brown
Summary
• Huffman codes are variable length and are based on the observed
frequency of characters.
• The best space savings for Huffman Coding compression is when the
variation in the frequencies of the letters is large.
25 September, 1999
16