Huffman Coding

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Huffman Coding

Lawrence M. Brown

Huffman Coding†

• Huffman Coding
• Variable Length Encoding
• Building a Tree
• Decoding
• Encoding

†Adapted from: http://www.apl.jhu.edu/~caj/

25 September, 1999
1
Huffman Coding
Lawrence M. Brown

Huffman Coding
• Huffman Coding is a variable-length prefix encoding algorithm for
compression of character streams.

• Codes are assigned to characters such that the length of the


code depends on the relative frequency of the corresponding
character.
Letter Frequency Letter Frequency
A 77 N 67
Examples:
B 17 O 67
• File compression: C 32 P 20
• JPEG images. D 42 Q 5
• MPEG movies. E 120 R 59
• Transmission of data over band-limited F 24 S 67
G 17 T 85
channels: H 50 U 37
• Modem data compression. I 76 V 12
J 4 W 22
K 7 X 4
L 42 Y 22
M 24 Z 2
1
Frequency of occurrence per 1000 letters .

1Shaffer, Clifford A., A Practical Introduction to Data Structures and Algorithm Analysis, Java Edition, Prentice Hall (1998).
25 September, 1999
2
Huffman Coding
Lawrence M. Brown

Data Representation
Bits and Bytes

• Digital computers store data in binary or base-2 format.


• A binary digit (bit) is represented by a 0 or 1.

• A byte is an 8-bit number and is typically the smallest size of a binary


number represented on a computer.

010010112 = 0 × 2 7 + 1× 2 6 + 0 × 25 + 0 × 2 4 + 1× 23 + 0 × 2 2 + 1× 21 + 1× 2 0
= 1× 64 + 1× 8 + 1× 2 + 1× 1
= 6410 + 810 + 210 + 110
= 7510.

Longer words (16-bit, 32-bit, 64-bit) are constructed from 8-bit bytes.

25 September, 1999
3
Huffman Coding
Lawrence M. Brown

Unicode and ASCII


• Unicode is an International Standard that defines an universal character set
(16-bit unsigned integers).

ACSII Character Set


• Unicode characters range from 0 to 65,535
0 1 2 3 4 5 6 7
(\u0000 to \uFFFF) and incorporate all 0 NUL SOH STX ETX EOT ENQ ACK BEL
8 BS HT NL VT NP CR SO SI
languages (English, Russian, Asian, etc.).
16 DLE DC1 DC2 DC3 DC4 NAK SYN ETB
24 CAN EM SUB ESC FS GS RS US
32 SP ! " # $ % & '
• Java stores characters (char) as Unicode. 40 ( ) * + , - . /
48 0 1 2 3 4 5 6 7
56 8 9 : ; < = ?
• The standard set of ASCII characters 64 @ A B C D E F G
72 H I J K L M N O
still range from 32 to 127 80 P Q R S T U V W
88 X Y Z [ \ ] ^ _
(\u0020 to \u007F Unicode). 96 ` a b c d e f g
104 h i j k l m n o
112 p q r s t u v w
• ASCII characters represent the lowest 7 120 x y z { | } ~ DEL
bits of the Unicode set, with the upper
9 bits set to zero.

25 September, 1999
4
Huffman Coding
Lawrence M. Brown

Variable-length Encoding
• Unicode and ASCII are fixed-length encoding schemes. All characters
require the same amount of storage (16 bits and 8 bits, respectively).

• Huffman coding is a variable-length encoding scheme. The number of


bits required to store a coded character varies according to the relative
frequency or weight of the character.

• A significant space savings is achieved for frequently used


characters (requiring only one, two or three bits).

• Little space saving is achieved for infrequent characters.

Letter Frequency Huffman Code


E 120
I 10
2

25 September, 1999
5
Huffman Coding
Lawrence M. Brown

Huffman Coding Tree

• A Huffman Coding Tree is built from the observed frequencies of


characters in a document.

• The document is scanned and the occurrence of each character is


recorded.

• Next, a Binary Tree is built in which the external nodes store the
character and the corresponding character frequency observed in
the document.

Often, pre-scanning a document and generating a custom Huffman


Coding Tree is impractical. Instead, typical frequencies are used
instead of specific frequencies from a particular document.

25 September, 1999
6
Huffman Coding
Lawrence M. Brown

Building a Huffman Coding Tree


• Consider the observed frequency of characters in a string that requires
encoding:

Character C D E F K L U Z
Frequency 32 42 120 24 7 42 37 2

• The first step is to construct a Priority Queue and insert each


frequency-character (key-element) pair into the queue.

• Step 1:

2 7 24 32 37 42 42 120
Z K F C U L D E

Sorted, sequence-based, priority queue.

25 September, 1999
7
Huffman Coding
Lawrence M. Brown

Building a Huffman Coding Tree


• In the second step, the two Items with the lowest key values are
removed from the priority queue.

• A new Binary Tree is created with the lowest-key Item as the left
external node, and the second lowest-key Item as the right external
node.
• The new Tree is then inserted back into the priority queue.

• Step 2:

24 32 37 42 42 120
9 F C U L D E

2 7
Z K

25 September, 1999
8
Huffman Coding
Lawrence M. Brown

Building a Huffman Coding Tree


• The process is continued until only one node (the Binary Tree) is left in
the priority queue.

37 42 42 120
• Step 3: 32
C
33 U L D E

9 24
F

2 7
Z K

• Step 4: 37 42 42 120
U L D 65 E

32
33
C

9 24
F

2 7
Z K

25 September, 1999
9
Huffman Coding
Lawrence M. Brown

Building a Huffman Coding Tree


• Step 5:

42 120
D 65 79 E

32 37 42
33
C U L

9 24
F

2 7
Z K

25 September, 1999
10
Huffman Coding
Lawrence M. Brown

Building a Huffman Coding Tree


• Final tree, after n = 8 steps:

306

120 186
E

79 107

37 42 42 65
U L D

32 33
C

9 24
F

2 7
Z K

25 September, 1999
11
Huffman Coding
Lawrence M. Brown

Building a Huffman Coding Tree


Algorithm Huffmann( X ):
Input: String X of length n.
Ouput: Coding tree for X.

Compute frequency f(c) of each character c in X.


Initialize a priority queue Q.
for each character c in X do
Create a single-node tree T storing c.
Insert T into Q with key f(c).
while Q.size() > 1 do
f1 ← Q.minKey()
T1 ← Q.removeMinElement()
f2 ← Q.minKey()
T2 ← Q. removeMinElement()
Create a new tree T with left subtree T1 and right subtree T2.
Insert T into Q with key f1 + f2.
return Q.removeMinElement() // return tree

25 September, 1999
12
Huffman Coding
Lawrence M. Brown
Decoding
• To decode a bit stream (from the leftmost bit), start at the root node of the Tree:
• move to the left child if the bit is a “0”.
• move to the right child if the bit is a “1”.
• When an external node is reached, the character at the node is sent to the
decoded string.
• The next bit is then decoded from the root of the tree.

306
0 1

120 186
E
0 1
Decode:
79 107
1011001110111101 0 1
0 1
L1001110111101
37 42 42 65
L U1110111101 U L D
0 1
L U C111101
L U C K 32
33
C
0 1

9 24
F
0 1

2 7
Z K
25 September, 1999
13
Huffman Coding
Lawrence M. Brown

Encoding
• Create a lookup table storing the binary code corresponding to the path
to each letter.
• If encoding ASCII text, an 128-element array would suffice.
String[] encoder = new String[128];
encoder[‘C’] = “1110”;

Character Frequency Code # bits


Encode: C 32 1110 4
D 42 110 3
DEED E 120 0 1
110EED F 24 11111 5
1100ED K 7 111101 6
L 42 101 3
11000D U 37 100 3
11000110 Z 2 111100 6

• ASCII representation would require 32 bits.


• Huffman encoding requires 8 bits.
25 September, 1999
14
Huffman Coding
Lawrence M. Brown

Analysis
• Define fi = frequency of letter li, i = 1, … , n.
• Define ci = cost for each letter li (number of bits).

∑c f i i
• Expected cost per character, ECPC = i =1
n
bits/character.
∑f
i =1
i

• Actual message length, ML = ECPC ⋅ N bits, where N is the total


number of characters in the message.

Character Frequency Code # bits


4 ⋅ 32 + 3 ⋅ 42 + 1⋅120 + 5 ⋅ 24 + 6 ⋅ 7 + 3 ⋅ 42 + 3 ⋅ 42 + 3 ⋅ 37 + 6 ⋅ 2
ECPC =
C 32 1110 4

32 + 42 + 120 + 24 + 7 + 42 + 37 + 2 characters
D 42 110 3
E 120 0 1
F 24 11111 5
K 7 111101 6
L 42 101 3 ≈ 2.57 bits/character
U 37 100 3
Z 2 111100 6

A fixed-length encoding on 8 characters would require 3 bits per character,


with an ML of 918 bits.

25 September, 1999
15
Huffman Coding
Lawrence M. Brown

Summary
• Huffman codes are variable length and are based on the observed
frequency of characters.

• No Huffman code for a character in the set is the prefix of another


character.

• The best space savings for Huffman Coding compression is when the
variation in the frequencies of the letters is large.

25 September, 1999
16

You might also like