Data Compression Unit-1 - 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Data Compression

Introduction:
The Compression algorithm or Compression technique that
takes an input X and generates a representation Xc that
requires fewer bits, and there is a reconstruction
algorithm that operates on the compressed
representation Xc to generate the reconstruction Y.
Data Compression
Types or Classes of Data Compression:
Based on the requirements of reconstruction, data
compression schemes can be divided into two broad
classes:
• Lossless Compression
– lossless compression schemes, in which Y is identical to X,
• Lossy Compression
– lossy compression schemes, which generally provide
much higher compression than lossless compression
but allow Y to be different from X.
Data Compression
Measure of Performance:
A compression algorithm can be evaluated in a
number of different ways:
– The memory required to implement the algorithm.
– How fast the algorithm performs on a given machine.
– The amount of compression.
– How closely the reconstruction resembles the
original.
Data Compression
Measure of Performance:
• Compression Ratio:
– the ratio of the number of bits required to represent the data
before compression to the number of bits required to
represent the data after compression.

Compression Ratio= Uncompressed Data/Compressed Data

• Rate:
– Another way of reporting compression performance is to
provide the average number of bits required to represent a
single sample.
• Distortion:
– The difference between the original and the reconstruction is
often called the distortion.
Data Compression
Modeling and Coding:
The development of data compression algorithms
for a variety of data can be divided into two
phases.

• Modeling
– In this phase we try to extract information about any redundancy that exists
in the data and describe the redundancy in the form of a model.

• Coding
– A description of the model and a “description” of how the data differ from
the model are encoded, generally using a binary alphabet.
(The difference between the data and the model is often referred to as the
residual.)
Data Compression
Information Theory:
 Used in development of Lossless Compression
Technique.
 Self Information i(A) :
 Suppose we have an Event A, which is set of outcomes of
some random experiment. If P(A) is the probability that the
event A will occur. Then the self information associate with
A is given by
Data Compression
Information Theory:
 The self information associated with the occurrence of
both event A and B

The unit information depends on the base of the log.


Log base 2 : bits
Log base e : nats
Log base 10 : hartleys
Data Compression
Information Theory:
 Entropy and Average Length:
 The quantity is called the entropy (H)associated with the
experiment. The entropy as a measure of the average
number of binary symbols needed to code the output of the
source. The unit of entropy is bits/symbol
Data Compression
Information Theory:
 Entropy and Average Length:
Data Compression
Information Theory:
 Entropy and Average Length:
Data Compression
Information Theory:
 Entropy and Average Length:
Data Compression
Models:
To develop techniques that manipulate data using
mathematical operations, we need to have a
mathematical model for the data.
• Physical Models:
• Probability Models:
• Markov Models
• Composite Source Models
Data Compression
Models:
• Physical Models:
– If we know something about the physics of the data generation
process, we can use that information to construct a model.
• Like Speech-related application, knowledge about the physics of
speech production can be used to construct a mathematical model
for the sampled speech process.
• Sampled speech can then be encoded using this model.

– Models for certain telemetry data can also be obtained


through knowledge of the underlying process.
• If residential electrical meter readings at hourly intervals were to be
coded, knowledge about the living habits of the populace could be used
to determine when electricity usage would be high and when the usage
would be low.
Data Compression
Models:
• Probability Models:
– The simplest statistical model for the source is to
assume that each letter that is generated by the
source is independent of each other letter, and each
occurs with the same probability.
• For a source that generates letters from an alphabet
Data Compression
Models (cont…):
• Markov Models
– One of the most popular ways of representing dependence in
the data is through the use of Markov models, named after the
Russian mathematician Andrei Andrevich Markov (1856–
1922).
– For models used in lossless compression, we use a specific type
of Markov process called a discrete time Markov chain.
– Let {Xn} be a sequence of observations. This sequence is said to
follow a kth-order Markov model if

– In other words, knowledge of the past k symbols is equivalent


to the knowledge of the entire past history of the process.
Data Compression
Models (cont…) :
• Composite Source Models
– In many applications, it is not easy to use a single
model to describe the source.
– In such cases, we can define a composite source,
which can be viewed as a combination or composition
of several sources, with only one source being active
at any given time.
Data Compression
Coding:
• The assignment of binary sequences to elements
of an alphabet. The set of binary sequences is
called code.
• The individual members of the set are called
codeword's.
• An alphabet is a collection of symbols called
letters .
• The ASCII code of ‘a’ is ‘1000011’ and ‘A’ is
‘1000001’
• Such a code is called as “Fixed-Length code”.
Data Compression
Coding:
• Prefix Codes
• Dangling Suffix
Data Compression
Coding:
• Uniquely Decodable Codes:
– The average length of the code is not the only
important point in designing a “good” code.
– Suppose our source alphabet consists of four letters
a1, a2, a3, and a4, with probabilities P(a1) = 1/2 ,
P(a2) = 1/4 , and P(a3) = P(a4) = 1/8 .
– The entropy for this source is 1.75 bits/symbol.
– The average length l for for each code is given by
Data Compression
Coding:
• Uniquely Decodable Codes:
– Four different codes for a four-letter alphabet

– Average Length (l)


Data Compression
Test for UDC:
• Construct a list of all the codewords.
• Examine all pairs of codewords to see if any
codeword is a prefix of another codeword.
• Whenever you find such a pair, add the dangling
suffix to the list unless you have added the same
dangling suffix to the list in a previous iteration.
• Now repeat the procedure using this larger list.
• Continue in this fashion until one of the following
two things happens:
1. You get a dangling suffix that is a codeword. (Not UDC)
2. There are no more unique dangling suffixes. (UDC)

You might also like