Huffman Coding

Huffman Coding
Lawrence M. Brown
Huffman Coding†
• Huffman Coding
• Variable Length Encoding
• Building a Tree
• Decoding
• Encoding
†Adapted from: http://www.apl.jhu.edu/~caj/
25 September, 1999
1
Huffman Coding
Lawrence M. Brown
Huffman Coding
• Huffman Coding is a variable-length prefix encoding algorithm for
compression of character streams.
• Codes are assigned to characters such that the length of the

code depends on the relative frequency of the corresponding
character.
Letter Frequency Letter Frequency
A 77 N 67
Examples:
B 17 O 67
• File compression: C 32 P 20
• JPEG images. D 42 Q 5
• MPEG movies. E 120 R 59
• Transmission of data over band-limited F 24 S 67
G 17 T 85
channels: H 50 U 37
• Modem data compression. I 76 V 12
J 4 W 22
K 7 X 4
L 42 Y 22
M 24 Z 2
1
Frequency of occurrence per 1000 letters .
1Shaffer, Clifford A., A Practical Introduction to Data Structures and Algorithm Analysis, Java Edition, Prentice Hall (1998).
25 September, 1999
2
Huffman Coding
Lawrence M. Brown
Data Representation
Bits and Bytes
• Digital computers store data in binary or base-2 format.

• A binary digit (bit) is represented by a 0 or 1.
• A byte is an 8-bit number and is typically the smallest size of a binary

number represented on a computer.
010010112 = 0 × 2 7 + 1× 2 6 + 0 × 25 + 0 × 2 4 + 1× 23 + 0 × 2 2 + 1× 21 + 1× 2 0
= 1× 64 + 1× 8 + 1× 2 + 1× 1
= 6410 + 810 + 210 + 110
= 7510.
Longer words (16-bit, 32-bit, 64-bit) are constructed from 8-bit bytes.
25 September, 1999
3
Huffman Coding
Lawrence M. Brown
Unicode and ASCII

• Unicode is an International Standard that defines an universal character set
(16-bit unsigned integers).
ACSII Character Set

• Unicode characters range from 0 to 65,535
0 1 2 3 4 5 6 7
(\u0000 to \uFFFF) and incorporate all 0 NUL SOH STX ETX EOT ENQ ACK BEL
8 BS HT NL VT NP CR SO SI
languages (English, Russian, Asian, etc.).
16 DLE DC1 DC2 DC3 DC4 NAK SYN ETB
24 CAN EM SUB ESC FS GS RS US
32 SP ! " # $ % & '
• Java stores characters (char) as Unicode. 40 ( ) * + , - . /
48 0 1 2 3 4 5 6 7
56 8 9 : ; < = ?
• The standard set of ASCII characters 64 @ A B C D E F G
72 H I J K L M N O
still range from 32 to 127 80 P Q R S T U V W
88 X Y Z [ \ ] ^ _
(\u0020 to \u007F Unicode). 96 ` a b c d e f g
104 h i j k l m n o
112 p q r s t u v w
• ASCII characters represent the lowest 7 120 x y z { | } ~ DEL
bits of the Unicode set, with the upper
9 bits set to zero.
25 September, 1999
4
Huffman Coding
Lawrence M. Brown
Variable-length Encoding
• Unicode and ASCII are fixed-length encoding schemes. All characters
require the same amount of storage (16 bits and 8 bits, respectively).
• Huffman coding is a variable-length encoding scheme. The number of

bits required to store a coded character varies according to the relative
frequency or weight of the character.
• A significant space savings is achieved for frequently used

characters (requiring only one, two or three bits).
• Little space saving is achieved for infrequent characters.
Letter Frequency Huffman Code

E 120
I 10
2
25 September, 1999
5
Huffman Coding
Lawrence M. Brown
Huffman Coding Tree
• A Huffman Coding Tree is built from the observed frequencies of

characters in a document.
• The document is scanned and the occurrence of each character is

recorded.
• Next, a Binary Tree is built in which the external nodes store the
character and the corresponding character frequency observed in
the document.
Often, pre-scanning a document and generating a custom Huffman

Coding Tree is impractical. Instead, typical frequencies are used
instead of specific frequencies from a particular document.
25 September, 1999
6
Huffman Coding
Lawrence M. Brown
Building a Huffman Coding Tree

• Consider the observed frequency of characters in a string that requires
encoding:
Character C D E F K L U Z
Frequency 32 42 120 24 7 42 37 2
• The first step is to construct a Priority Queue and insert each

frequency-character (key-element) pair into the queue.
• Step 1:
2 7 24 32 37 42 42 120
Z K F C U L D E
Sorted, sequence-based, priority queue.
25 September, 1999
7
Huffman Coding
Lawrence M. Brown

• In the second step, the two Items with the lowest key values are
removed from the priority queue.
• A new Binary Tree is created with the lowest-key Item as the left
external node, and the second lowest-key Item as the right external
node.
• The new Tree is then inserted back into the priority queue.
• Step 2:
24 32 37 42 42 120
9 F C U L D E
2 7
Z K
25 September, 1999
8
Huffman Coding
Lawrence M. Brown

• The process is continued until only one node (the Binary Tree) is left in
the priority queue.
37 42 42 120
• Step 3: 32
C
33 U L D E
9 24
F
2 7
Z K
• Step 4: 37 42 42 120
U L D 65 E
32
33
C
9 24
F
2 7
Z K
25 September, 1999
9
Huffman Coding
Lawrence M. Brown

• Step 5:
42 120
D 65 79 E
32 37 42
33
C U L
9 24
F
2 7
Z K
25 September, 1999
10
Huffman Coding
Lawrence M. Brown

• Final tree, after n = 8 steps:
306
120 186
E
79 107
37 42 42 65
U L D
32 33
C
9 24
F
2 7
Z K
25 September, 1999
11
Huffman Coding
Lawrence M. Brown

Algorithm Huffmann( X ):
Input: String X of length n.
Ouput: Coding tree for X.
Compute frequency f(c) of each character c in X.

Initialize a priority queue Q.
for each character c in X do
Create a single-node tree T storing c.
Insert T into Q with key f(c).
while Q.size() > 1 do
f1 ← Q.minKey()
T1 ← Q.removeMinElement()
f2 ← Q.minKey()
T2 ← Q. removeMinElement()
Create a new tree T with left subtree T1 and right subtree T2.
Insert T into Q with key f1 + f2.
return Q.removeMinElement() // return tree
25 September, 1999
12
Huffman Coding
Lawrence M. Brown
Decoding
• To decode a bit stream (from the leftmost bit), start at the root node of the Tree:
• move to the left child if the bit is a “0”.
• move to the right child if the bit is a “1”.
• When an external node is reached, the character at the node is sent to the
decoded string.
• The next bit is then decoded from the root of the tree.
306
0 1
120 186
E
0 1
Decode:
79 107
1011001110111101 0 1
0 1
L1001110111101
37 42 42 65
L U1110111101 U L D
0 1
L U C111101
L U C K 32
33
C
0 1
9 24
F
0 1
2 7
Z K
25 September, 1999
13
Huffman Coding
Lawrence M. Brown
Encoding
• Create a lookup table storing the binary code corresponding to the path
to each letter.
• If encoding ASCII text, an 128-element array would suffice.
String[] encoder = new String[128];
encoder[‘C’] = “1110”;
Character Frequency Code # bits

Encode: C 32 1110 4
D 42 110 3
DEED E 120 0 1
110EED F 24 11111 5
1100ED K 7 111101 6
L 42 101 3
11000D U 37 100 3
11000110 Z 2 111100 6
• ASCII representation would require 32 bits.

• Huffman encoding requires 8 bits.
25 September, 1999
14
Huffman Coding
Lawrence M. Brown
Analysis
• Define fi = frequency of letter li, i = 1, … , n.
• Define ci = cost for each letter li (number of bits).
∑c f i i
• Expected cost per character, ECPC = i =1
n
bits/character.
∑f
i =1
i
• Actual message length, ML = ECPC ⋅ N bits, where N is the total

number of characters in the message.
Character Frequency Code # bits

4 ⋅ 32 + 3 ⋅ 42 + 1⋅120 + 5 ⋅ 24 + 6 ⋅ 7 + 3 ⋅ 42 + 3 ⋅ 42 + 3 ⋅ 37 + 6 ⋅ 2
ECPC =
C 32 1110 4
32 + 42 + 120 + 24 + 7 + 42 + 37 + 2 characters
D 42 110 3
E 120 0 1
F 24 11111 5
K 7 111101 6
L 42 101 3 ≈ 2.57 bits/character
U 37 100 3
Z 2 111100 6
A fixed-length encoding on 8 characters would require 3 bits per character,

with an ML of 918 bits.
25 September, 1999
15
Huffman Coding
Lawrence M. Brown
Summary
• Huffman codes are variable length and are based on the observed
frequency of characters.
• No Huffman code for a character in the set is the prefix of another

character.
• The best space savings for Huffman Coding compression is when the
variation in the frequencies of the letters is large.
25 September, 1999
16

Huffman Coding

Uploaded by

Copyright:

Available Formats

Huffman Coding

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Huffman Coding

Uploaded by

Copyright:

Available Formats

Huffman Coding

†Adapted from: http://www.apl.jhu.edu/~caj/

• Codes are assigned to characters such that the length of the

• Digital computers store data in binary or base-2 format.

• A byte is an 8-bit number and is typically the smallest size of a binary

Unicode and ASCII

ACSII Character Set

• Huffman coding is a variable-length encoding scheme. The number of

• A significant space savings is achieved for frequently used

• Little space saving is achieved for infrequent characters.

Letter Frequency Huffman Code

Huffman Coding Tree

• A Huffman Coding Tree is built from the observed frequencies of

• The document is scanned and the occurrence of each character is

Often, pre-scanning a document and generating a custom Huffman

Building a Huffman Coding Tree

• The first step is to construct a Priority Queue and insert each

Sorted, sequence-based, priority queue.

Building a Huffman Coding Tree

Building a Huffman Coding Tree

Building a Huffman Coding Tree

Building a Huffman Coding Tree

Building a Huffman Coding Tree

Compute frequency f(c) of each character c in X.

Character Frequency Code # bits

• ASCII representation would require 32 bits.

• Actual message length, ML = ECPC ⋅ N bits, where N is the total

Character Frequency Code # bits

A fixed-length encoding on 8 characters would require 3 bits per character,

• No Huffman code for a character in the set is the prefix of another

You might also like