Huffman Coding Technique

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Huffman Coding Technique

Huffman coding, developed by David A. Huffman is an entropy


encoding algorithm (i.e., it is a lossless data compression
scheme that is independent of the specific characteristics of the
medium) used for lossless data compression. The term refers to
the use of a variable-length code table for encoding a source
symbol (such as a character in a file) where the variable-length
code table has been derived in a particular way based on the
estimated probability of occurrence for each possible value of the
source symbol. The codes generated using this technique are
called Huffman codes. These codes are prefix codes, which are
optimum for a given model.

Fixed-Length versus Variable-Length Codes
In a fixed length code each codeword has the same length,
whereas in the variable length code, codewords may be of
different length.
Suppose we want to store messages made up of characters a, b,
c, d, e, f with frequencies(in 1000) given in the table respectively:



The fixed length code word requires:
(45+13+12+16+9+5)*3*1000= 300,000 bits to store the file
where as in the case of the variable length code, it requires:
(45*1+13*3+12*3+16*3+9*4+5*4)*1000= 224,000 bits to store
the data.
This example clearly shows that the variable length codes are
more optimum than the fixed length code.



Prefix-Code ( Prefix-free codes)
Prefix codes are the codes in which a bit string representing some
particular symbol is never a prefix of the bit string representing
any other symbol. Every message encoded by a prefix code is
uniquely decipherable, since no code word is a prefix of any other
code.
For example: C = { a = 1, b = 110, c = 10, d = 111}
Than bad is encoded into 1101111.
But the code 1101111 can be deciphered as bad and it can also
be deciphered as acda or acad.
But if the code is taken as C = { a = 0, b = 110, c = 10, d = 111}
Than bad is encoded into 1100111.
And the deciphered data is:
1100111= 1100111=bad
which is unique.
Thus, prefix codes are more optimum for decoding of the
message, in order for a code to be useful.

Construction of Huffman codes is based on observations
regarding optimum prefix code:
1. In an optimum code, symbols with higher probability should
have shorter codewords as compared to the symbols with lower
probability.
2. In an optimum code, the two symbols that occur with least
frequency will have the same length (otherwise, the truncation of
the longer codeword to the same length still produce a decodable
code).

Algorithm of Huffman Coding:

1. Take the characters and their frequencies, and sort this list by
increasing frequency
2. All the characters are vertices of the tree.
3. Take the first 2 vertices from the list and make them children of
a vertex having the sum of their frequencies.
4. Insert the new vertex into the sorted list of vertices waiting to be
put into the tree.
5. If there are at least 2 vertices in the list, go to step 3.
6. Give the value of left branch as 0 and the value of right branch
as 1.
7. Read the Huffman code from the tree.
For example:
Consider the following characters in their increasing order of
frequencies:
e 1
d 1
c 3
a 5
b 9

Thus, on following the above algorithm steps we get the following
tree:

Huffman codes are:
b=0
a=11
c=101
d=1000
e=1001

Flow chart


Example:

Formalized description
Input.
Alphabet , which is the symbol alphabet of size
.
Set , which is the set of the (positive) symbol
weights (usually proportional to probabilities), i.e.
.
Output.
Code , which is the set of (binary)
codewords, where is the codeword for .
Goal.
To find:
Weighted path length L(C) =

.
Average code length


Entropy H(A)=


Where wi is the probability or weight of the symbol
Examples:

Efficiency of Huffman Codes
Redundancy(r) the difference between the entropy and the
average length of a code.

In this example, if we use fixed-length codes than we have to
spend three bits per sample, which gives code redundancy of
3-2.122=0.878bps.
For Huffman code, the redundancy is zero when the probabilities
are negative powers of two.

Minimum Variance Huffman Codes

When two or more than two symbols in a Huffman tree have the
same probability, different merge orders produce different
Huffman codes.


Two code trees with same symbol probabilities:



Encoding of data:
Given the characters and their frequencies, perform the
algorithm and generate a code. Write the characters using the
code
Decoding of data:
Given the Huffman tree, figure out what each character is
(possible because of prefix property)

ENCODING:

Algorithm:
1. Find the grey-level probabilities for the image by finding the
histogram.
2. Order the input probabilities (histogram magnitudes) from
smallest to largest.
3. Combine the smallest two by addition.
4. GOTO step 2, until only two probabilities are left
5. By working backward along the tree, generate code by
alternating assignment of 0 and 1.

Coding Procedures for an N-symbol source:
1. Source reduction
a. List all probabilities in a descending order
b. Merge the two symbols with smallest probabilities into
a new compound symbol
c. Repeat the above two steps for N-2 steps
2. Codeword assignment
a. Start from the smallest source and work back to the
original source
b. Each merging point corresponds to a node in binary
codeword tree

Example:
Consider an image with 3 bits/pixel, giving 8 possible gray levels.
The image is of 10 rows by 10 columns.
Step 1: Find the histogram for the image and Convert into
probabilities by normalizing to the total number of pixels,
Gray level 1 has 20 pixels
Gray level 2 has 40 pixels
Gray level 3 has 20 pixels
Gray level 4 has 10 pixels
Gray level 5 has 10 pixels

Step 2: The probabilities are ordered.
a
5
=

0.1
a
4
=

0.1
a
3
=

0.2
a
1
=

0.2
a
2
=

0.4
Step 3: Combine the smallest two by addition.
Step 4: Repeat steps 2 and 3, where reorder (if necessary) and
add the two smallest probabilities. Reorder and add until only 2
values remain.
Step 5: Actual code assignment is made. Start on the right-hand
side of the tree and assign 0s & 1s.

Gray level represented by 1 bit, a
2
, is the most likely to occur
(40% of the time) & thus has least information in the information
theoretic sense.

DECODING:
The process of decompression is simply a matter of translating
the stream of prefix codes to individual byte values, usually by
traversing the Huffman tree node by node as each bit is read from
the input stream (reaching a leaf node terminates the search for
that particular byte value). For this however, the Huffman tree
must be somehow reconstructed or otherwise, the information to
reconstruct the tree must be sent a priori.


MATLAB CODE:


The output screen shot:

Advantages:
Algorithm is easy to implement.
Produce a lossless compression of images
Reduce size of data by 20%-90% in general.
Huffman codes can expresses the most common source
symbols using shorter strings of bits than are used for less
common source symbols.
The running time of Huffman's method is fairly efficient.
Huffman code is the most efficient compression method: no
other mapping of individual source symbols to unique strings
of bits will produce a smaller average output size when the
actual symbol frequencies agree with those used to create
the code.
It is generally beneficial to minimize the variance of
codeword length.


Disadvantages:
Although Huffman's original algorithm is optimal for a
symbol-by-symbol coding (i.e. a stream of unrelated
symbols) with a known input probability distribution, it is not
optimal when the symbol-by-symbol restriction is dropped, or
when the probability mass functions are unknown, not
identically distributed, or not independent (e.g., "cat" is more
common than "cta"). Other methods such as arithmetic
coding and LZW coding often have better compression
capability.
If no characters occur more frequently than others, then no
advantage over fixed length code, such as ASCII.

Applications
Huffman coding is a technique used to compress files for
transmission
Uses statistical coding
more frequently used symbols have shorter code words
Works well for text and fax transmissions
An application that uses several data structures
Both the .mp3 and .jpg file formats use Huffman coding at
one stage of the compression.
Alternative method that achieves higher compression but is
slower is patented by IBM, making Huffman Codes
attractive.

You might also like