Project 1
Project 1
Project 1
Huffman coding
TA : Sanketh Indarapu
1 Objective
Given a frequency distribution of symbols, the Huffman algorithm assigns codes to each symbol so as to
minimize the size of the message. The message is then encoded using this symbol-to-code mapping and
transmitted to the receiver. For the receiver to decode the encoded message, it is necessary for the receiver
to have some knowledge of the symbol-to-code mapping. This information is generally transmitted as the
header part of the encoded file. In this project, you will implement techniques to minimize the size of this
header.
2 Description
In this section, we informally describe the details involved in the implementation. Later, we will actually
lay down the specifications of the project.
Before reading further, you are advised to first refer to the class notes for Huffman encoding. Under-
standing those portions is crucial to what follows.
Consider different ways of informing the receiver of the symbol-to-code mapping. Most common imple-
mentations transmit the frequency distribution of the symbols in the header part. The idea behind this
approach is that the receiver can then build the Huffman tree and thereby, learn the codes for the symbols.
However, this approach needs additional guarantees to succeed. For a given frequency distribution, different
Huffman trees are possible - all of them are optimal in the sense that they have the same codeword lengths
for all the symbols, but the actual codes for the symbols differ. For the sender and the receiver to build the
Huffman tree, there must be a guarantee that both of them build the same Huffman tree. This is generally
guaranteed in practice by specifying that both the algorithms break ties in the same way - that is, they
handle cases when there are equal contenders for choices in the same way.
Note that this frequency information can be possibly large, because there is one count per symbol. The
count could possibly be a large value. Nevertheless, this is what is used in practice because, in general, the
size of the data far overwhelms the size of the header. However, in this project, we will try and optimize
this header size, mainly because it is instructive to do so.
1
is to send just the codeword lengths of the symbols over to the receiver. The receiver and the sender should
then agree on a common optimal tree to build from the codeword length information to determining the
actual codes. Then, they both will have the same codewords for all the symbols, thus achieving the original
purpose. Note that this approach differs from the one described in the previous subsection in agreeing on
the definition of a common optimal tree, whereas the previous approach agrees on a common algorithm to
build the optimal tree. A quick order estimate shows that this would cost about O(|A| ∗ log(max depth)),
that is, in our case, 256*8 bits. This is especially neat, because we would then be using a byte for each
codeword length value (which can be anywhere between 1 and 255 in the worst cases).
The approach of the previous paragraph therefore seems promising. The sender would send the receiver
the codeword lengths as a list of 256 values. Then both would construct a canonical optimal tree (that we
shall call the canonical optimal tree now on) and use this tree for the encoding and the decoding purposes.
Note that this seems to imply that the sender constructs two trees - one for finding the codeword lengths
themselves and one for the canonical optimal tree. However, this is not necessarily the case and the sender
could execute a minimal version of the Huffman algorithm that does not involve building the tree, but instead
relies on simpler data structures.
2
leaves at depth i.
At this stage, therefore, we are going to send, as part of the header, two sets of data. First, the frequency
distribution of the codeword lengths, that is, the number of codewords with a given codeword length. There
are going to be 256 bytes for this set (with one byte for each value). The next set is the encoding of
the 256 codeword lengths of the 256 symbols using the canonical optimal tree obtained from the frequency
distribution of the first set. These two sets complete the header portion.
Example: If A[0] = 0, A[1] = 2 = 102 , A[2] = 7 = 1112 , A[3] = 3 = 112 , A[4] = 4 = 1002 , A[5] = 240 =
111100002 , and A[i] = 0 for 6 ≤ i < 256, then the output of the encoder is
011001011001110001111101110000
3 Specification
In order to help organize the coding involved in this project, we now lay down the interfaces between the
different parts of the assignment and the classes in which the code is supposed to be written.
Most of the data in this project is handled as a stream of bits. There is no corresponding abstraction in
the common programming languages. We therefore, provide a class called “BitStream” to help support this
3
abstraction. This class supports the reading and writing of bits and bytes and should serve as the medium
of exchange of data across interfaces.
The BitStream class has the following specifications -
BitStream(String inputFileName, boolean type) - initializes the bit stream from a file. The type
field determines the type of the file. If it is true, the file is assumed to be a byte file and is read the usual
way. If the type field is false, the file is assumed to be a bit file written earlier using the toBitFile method.
boolean readBit() - reads one bit from the stream. The return value is the truth value of the next
bit in the stream. It throws java.io.EOFException if the stream is empty or if there are problems associated
with it. You are advised to check the stream using isEmpty() before each read to recognize EOF instead of
relying on this exception.
byte readByte() - reads one byte from the stream. This method too throws java.io.EOFException on
empty stream or if there are problems with the stream.
int writeBit(boolean bit) - writes the specified bit to the stream. It returns non-zero on success and 0
on failure.
int writeByte(byte b) - writes the specified byte to the stream and returns the success status as above.
int toBitFile(String outputFileName) - writes the bits in the stream to a file. The file is specially
marked to handle the conversion from bits to bytes. It returns 0 if the operation fails.
int toByteFile(String outputFileName) - writes this stream to a file as if it were a byte stream. Any
exra bits that do not form a complete byte will be discarded. It returns 0 if the operation fails.
byte [] toBytes() - converts the stream to an array of bytes. Any extra bits that do not form a complete
byte will be discarded.
There should be one main class called “CanonicalTree”. This class encapsulates the canonical optimal
tree described in Section 2.2. Note that while in reality, the sender and the receiver would be using two
different copies of the CanonicalTree class and agreeing on the specification of the tree, in this project, we
have only one class for both these abstractions. However, your code for the decoder (resp. encoder)
should work correctly with any other encoder (resp. decoder) that meets the specifications.
This class must have two constructors to specify how a tree can be built -
CanonicalTree(int [] frequencies) ;
which builds a canonical optimal tree from the frequencies of the 256 symbols. A 0 frequency means that
4
the symbol is not present.
This constructor is used by the sender to construct a canonical optimal tree from the input data and
by the receiver to construct the tree for the second-level coding of the codeword lengths. The canonical
optimal tree is however, defined in terms of the codeword lengths alone. Hence, in this constructor, you will
have to get the codeword lengths first from the frequencies by executing a simplified version of the Huffman
algorithm (you could always execute the Huffman algorithm completely and get the codeword lengths from
the tree thus built, but this is not necessary and it is possible to do this using simpler data structures)
and then use the code of the previous constructor to build the tree. This simplified version of the Huffman
algorithm must be the same with respect to non-deterministic choices for both the sender and the receiver.
This means that ties must be broken in exactly the same manner. We therefore require that the following
choices be made in the implementation of the Huffman-like algorithm:
1. Each node is associated with a symbol - an internal node of the tree is associated with the smaller of
the symbols of its children and a leaf node is associated with the symbol it represents.
2. Whenever there is a tie among nodes with equal frequencies, the algorithm selects the nodes in the lexi-
cographic order of their corresponding symbols - that is, in the order of their ASCII values. Specifically,
it chooses the two lexicographically smallest values.
This class also should contain four other methods -
byte [] codewordLengths() ;
return an array of 256 bytes, with byte i being the length of the codeword of the ASCII character with value
i. The length should be 0 if the corresponding symbol does not occur in the input data.
byte [] codewordLengthFrequencies() ;
returns the frequency of distribution of the codeword lengths as a 256 byte array. Byte i is the number of
codewords with codeword length i, with codeword length 0 interpreted as above.
int specialEncode(byte [] input, BitStream encoded) ; which takes an array of 256 bytes through
the array “input” and outputs the encoded data (where the encoding is performed according to the pseu-
docode in Section 2.4) to the BitStream “encoded”. The return code follows the usual convention.
byte [] specialDecode(BitStream encoded) ; which outputs the 256 byte array formed by decoding
the bitstream given. This method should terminate when it recovers 256 bytes. No other indication of
termination will be found.
5
The code that ties together all these individual bits of code is encapsulated in the classes “Sender”
and “Receiver”, which represent the sending and the receiving process respectively. These classes contain
functions send and receive respectively, whose code is provided by us. This code uses the interfaces specified
so far and it is necessary for you to stick to the specifications closely to get your final code to run. These
two classes also have main methods to take parameters from the command line. The encoding process is all
put together in the send function and the decoding process in the receive function. The transmission of the
message is simulated by writing the encoded message to a file through the BitStream class and then reading
the file in the Receiver class using the BitStream class.
The code for the BitStream, Sender and Receiver classes will be put up on the web page.
5 Deliverables
You are required to turn in the source code files for CanonicalTree.java, SpecialEncoding.java and any other
auxiliary source code files to “project 1”.