Unit 8
Unit 8
Unit 8
UNIT – VIII
SORTING
1. Let A be a list of n elements A1, A2, . . . , An in memory.
2. Sorting A refers to the operation of rearranging the contents of A so that they are
increasing in order (numerically or lexicographically) that is, so that
3. A1 ≤ A2 ≤ A3 ≤ ……. An
4. Since A has n elements, there are n! ways that the contents can appear in A.
5. These ways correspond precisely to the n! permutations of 1, 2, . . . , n. Accordingly, each sorting algorithm must
take care of these n! possibilities.
EXAMPLE
1. Suppose an array DATA contains 8 elements as follows:
2. DATA: 77, 33, 44, 11, 88, 22, 66, 55
3. After sorting, DATA must appear in memory as follows:
4. DATA: 11, 22, 33, 44, 55, 66, 77, 88
5. Since DATA consists of 8 elements, there are 8! =40,320 ways that the
numbers 11, 22,. ,88 can appear in DATA.
3. Normally, the complexity function measures only the number of comparisons, since
the number of other operations is at most a constant factor of the number of
comparisons.
4. There are two main cases whose complexity we will consider; the worst case and the
average case. In studying the average case, we make the probabilistic assumption that
all the n! permutations of the given n items are equally likely. (The reader is referred
to Sec. 2.5 for a more detailed discussion of complexity.)
5. Previously, we have studied the bubble sort (Sec. 4.6), quick sort (Sec. 6.5) and heap
sort (Sec. 7.10). The approximate number of comparisons and the order of complexity
of these algorithms are summarized in the following table:
6. Note first that the bubble sort is a very slow way of sorting; its main
advantage is the simplicity of the algorithm.
7. Observe that the average-case complexity (n log n) of heap sort is the same
as that of quick sort, but its worst - case complexity (n log n) seems quicker
than quick sort (n2).
8. However, empirical evidence seems to indicate that quick sort is superior to
heap sort except on rare occasions.
Lower Bounds
1. The reader may ask whether there is an algorithm which can sort n items in time
of order less than O(n log n).
2. The answer is no. The reason is indicated below. Suppose S is an algorithm
which sorts n items a1, a2, . . . , an.
3. We assume there is a decision tree T corresponding to the algorithm S such that
T is an extended binary search tree where the external nodes correspond to the
n! ways that n items can appear in memory and where the internal nodes
correspond to the different comparisons that may take place during the
execution of the algorithm S.
4. Then the number of comparisons in the worst case for the algorithm S is equal to
the length of the longest path in the decision tree T or, in other words, the depth
D of the tree, T.
5. Moreover, the average number of comparisons for the algorithm S is equal to the
average external path length E of the tree T.
6. Figure 9-1 shows a decision tree T for sorting n = 3 items. Observe that the as n!
= 3! = 6 external nodes. The values of D and E for the tree follow:
7. D=3 and E= 1/6 (2+3+3+3+3+2)=2.667
8. Consequently, the corresponding algorithm S requires at most (worst case) D = 3
comparisons and, on the average, E = 2.667 comparisons to sort the n = 3 items.
9. Accordingly, studying the worst- case and average-case complexity of a sorting
algorithm S is reduced to studying the values of D and E in the corresponding
decision tree T.
10. First, however, we recall some facts about extended binary trees (Sec. 7.11).
Suppose T is an extended binary tree with N external nodes, depth D and
external path length E(T). Any such tree cannot have more than 2D external
nodes, and so
11. 2D ≥ N or equivalently D ≥ log N
12. Furthermore, T will have a minimum external path length E(L) among all such
trees with N nodes when T is a complete tree. In such a case,
13. E(L)=N log N + 0(N )≥ N log N
14. The N log N comes from the fact that there are N paths with length log N or log
N + 1, and the 0(N) comes from the fact that there are at most N nodes on the
deepest level.
15. Dividing E(L) by the number N of external paths gives the average external path
length E. Thus, for any extended binary tree T with N external nodes,
16. Now suppose T is the decision tree corresponding to a sorting algorithm S which
sorts n items. Then T has n! external nodes.
17. Substituting n! for N in the above formulas yields
18. D ≥ log n ! ≅ n log n and E ≥ log n ! = n log n
19. The condition log n’ n log n comes from Stirlings formula. That
20. Thus n log n is a lower bound for both the worst case and the average case.
21. In other words, 0(n log n) is the best possible for any sorting algorithm which
sorts n items.
Sorting Files;
Sorting Pointers
EXAMPLE 9.2
1. Suppose the personnel file of a company contains the following data on each of
its employees:
2. Name Social Security Number Sex Monthly Salary
3. Sorting the file with respect to the Name key will yield a different order of the
records than sorting the file with respect to the Social Security Number key.
4. The company may want to sort the file according to the Salary field even though
the field may not uniquely determine the employees.
5. Sorting the file with respect to the Sex key will likely be useless; it simply
separates the employees into two sub files, one with the male employees and
one with the female employees.
6. Sorting a file F by reordering the records in memory may be very expensive when
the records are very long.
7. Moreover, the records may be in secondary memory, where it is even more time-
consuming to move records into different locations.
8. Accordingly, one may prefer to form an auxiliary array POINT containing pointers
to the records in memory and then sort the array POINT with respect to a field
KEY rather than sorting the records themselves.
9. That is, we sort POINT so that
10. KEY[POINT[1]] ≤ KEY[POINT[2]] ≤ . .. ≤ KEY[POINT[N]]
11. Note that choosing a different field KEY will yield a different order of the array
POINT.
12.
EXAMPLE 9.3
Fig. 9 - 2
and PTRSSN shows the pointers sorted according to the SSN field, that is,
5. SSN[PTRSSN[1]] <SSN{PTRS5N[2]] <••…. <SSN[PTRSSN[9]]
6. Given the name (EMP) of an employee, one can easily find the location of NAME in
memory using the array PTRNAME and the binary search algorithm.
7. Similarly, given the social security number NUMB of an employee, one can easily
find the location of the employee’s record in memory by using the array PTRSSN
and the binary search algorithm.
8. Observe, also, that it is not even necessary for the records to appear in successive
memory locations. Thus inserting and deleting records can easily be done.
INSERTION SORT
1. Suppose an array A with n elements A[1], A[2],..., A[N] is in memory.
2. The insertion sort algorithm scans A from A[1] to A[N], inserting each element
A[K] into its proper position in the previously sorted sub array A[1],
A[2]………A[K—1]. That is:
3. Pass 1. A[1] by itself is trivially sorted.
4. Pass 2. A[2] is inserted either before or after A[ti] so that: A[1], A[2] is sorted.
5. Pass 3. A[3] is inserted into its proper place in A[1], A[2], that is, before A[1],
between A[1J and A[2], or after A[2], so that: A[1], A[2], A[3] is sorted.
6. Pass 4. A[4] is inserted into its proper place in A[1], A[2], A[3] so that:
7. A[1], A[2], A[3], A[4] is sorted.
8. ……………………………………………………………………………………
9. Pass N. A[N] is inserted into its proper place in A[1], A[2]……A[N-1] so that:A[1],
A[2], . . . , A[N] is sorted.
10. This sorting algorithm is frequently used when n is small. For example, this
algorithm is very popular with bridge players when they are first sorting their
cards.
11. There remains only the problem of deciding how to insert A[K] in its proper place
in the sorted sub array A[l], A[2],……A[K—1].
12. This can be accomplished by comparing A[K] with A[K-1], comparing A[K] with
A[K—2], comparing A[K] with A[K-3], and so on, until first meeting an element
A[J] such that A[J] ≤ A[K]. Then each of the elements A[K-1], A[K—2],. . . ,A[J+1] is
moved forward one location, and A[K] is then inserted in the J+lst position in the
array.
13. The algorithm is simplified if there always is an element A[J] such that A[J] ≤ A[K];
otherwise we must constantly check to see if we are comparing A[K] with A[1].
14. This condition can be accomplished by introducing a sentinel element A[0] = - ∞
(or a very small number).
Remark: Time may be saved by performing a binary search, rather than a linear search, to find
the location in which to insert A[K] in the sub array Atli, A[2],. . . , A[K — 1]. This
requires, on the average, log K comparisons rather than (K — 1)/ 2 comparisons.
However, one still needs to move (K — 1)/ 2 elements forward. Thus the order of
complexity is not changed. Furthermore, insertion sort is usually used only when n in
small, and in such a case, the linear search is about as efficient as the binary search\
SELECTION SORT
1. Suppose an array A with n elements A[1], A[2], . . . , A[N] is in memory.
2. The selection sort algorithm for sorting A works as follows.
3. First find the smallest element in the list and put it in the first position.
4. Then find the second smallest element in the list and put it in the second position.
5. And so on. More precisely:
Pass 1. Find the location LOC of the smallest in the list of N elements
A[1], A[2],. . . , A[N], and then interchange A[LOC] and A[1]. Then:
A[1] is sorted.
Pass 2. Find the location LOC of the smallest in the sub ljst of N — 1 elements
A[2], A[3]……..A[N], and then interchange A[LOC] and A[2].
Then: A[1], A[2] is sorted, since A[1] ≤ A[2].
Pass 3. Find the location LOC of the smallest in the sub ljst of N — 2
elements A[3], A[4],... , A[N], and then interchange A[LOC] and
A[3]. Then:
A[1], A[2]….., A[3] is sorted, since A[2] ≤ A[3].
…………………………………………………….
……………………………………………………
Pass N — 1.Find the location LOC of the smaller of the elements A[N — 1],
A[N], and then interchange A[LOC] and A[N — 1]. Then:
A[1], A[2], . . . , A[N] is sorted, since A[N — 1] ≤ A[N].
Thus A is sorted after N — 1 passes.
Example 9.5
1. Suppose an array A contains 8 elements as follows:
77, 33, 44, 11, 88, 22, 66, 55
2. Applying the selection sortalgorithm to A yields the data in Fig. 9-4.
3. Observe that LOC gives the location of the smallest among A[K], A[K +
1],……..A[N] during Pass K.
4. The circled elements indicate the elements which are to be interchanged.
5. There remains only the problem of finding, during the Kth pass, the location LOC
of the smallest among the elements A[K], A[K + 1],………. A[N].
6. This may be accomplished by using a variable MIN to hold the current smallest
value while scanning the sub array from A[K] to A[N].
7. Specifically, first set MIN := A[K] and LOC := K, and then traverse the list,
comparing MIN with each other element A[j] as follows:
1. First note that the number f(n) of comparisons in the selection sort algorithm is
independent of the original order of the elements Observe that MIN(A, K, N, LOC)
requires n K comparisons
2. That is, there are n — 1 comparisons during Pass 1 to find the smallest element,
there are n — 2 comparisons during Pass 2 to find the second smallest element,
and so on. Accordingly,
3. f(n) = (n – 1) + (n -2) + …… + 2 + 1 = n (n -1) = 0(n2)
4. The above result is summarized in the following table:
Remark: The number of interchanges and assignments does depend on the original
order of the elements in the array A, but the sum of these operations does not exceed a
factor of n2.
MERGING
1. Suppose A is a Sorted list with r elements and B is a sorted list with s elements
2. The operation that combines the elements of A and B into a single sorted list C
with n = r + s elements is called merging
3. One simple way to merge is to place the elements of B after the elements of A
and then use some sorting algorithm on the entire list.
4. This method does not take advantage of the fact that A and 13 are individually
sorted. A much more efficient algorithm is Algorithm 9.4 in this section.
5. First, however, we indicate the general idea of the algorithm by means of two
examples.
6. Suppose one is given two sorted decks of cards.
7. The decks are merged as in Fig. 9-5.
8. That is, at each step, the two front cards are compared and the smaller one is
placed in the combined deck.
9. When one of the decks is empty, all of the remaining cards in the other deck are
put at the end of the combined deck.
10. Similarly, suppose we have two lines of students sorted by increasing heights,
and suppose ~e want to merge them into a single sorted line.
11. The new line is formed by choosing. at each step. the shorter of the two students
who are at the head of their respective lines.
12. When one of the lines has no more students, the remaining students line up at
the end of the combined line.
13. The above discussion will now be translated into a formal algorithm which
merges a sorted r – element array A and a sorted s – element array B into a
sorted array C, with n = r + s elements.
14. First of all, we must always keep track of the locations of the smallest element of
A and the smallest element of B which have not yet been placed in C.
15. Let NA and NB denote these locations, respectively. Also, let PTR denote the
location inC to be filled. Thus, initially, we set NA =1, NB land PTR := ~ At each
step of the algorithm, we compare
16. A[NA] and B[NB]
17. and assign the smaller element to C[PTR]. Then we increment PTR by setting
PTR:= PTR + 1, and we either increment NA by setting NA:~ NA + 1 or increment
NB by setting NB:= NB + 1, according to whether the new element in C has come
from A or from B.
18. Furthermore, if NA> r, then the remaining elements of B are assigned to C; or if
NB > s, then the remaining elements of A are assigned to C.
19. The formal statement of the algorithm follows.
EXAMPLE 9.6
MERGE-SORT
Suppose an array A with n elements A[1], A[2],……….. A[N] is in memory. The
merge-sort algorithm which sorts A will first be described by means of a specific
example.
EXAMPLE 9.7
5. The above merge-sort algorithm for sorting an array A has the following
important property.
6. After Pass K, the array A will be partitioned into sorted sub arrays where each
sub array, except possibly the last, will contain exactly L = 2K elements.
7. Hence the algorithm requires at most log n passes to sort an n-element array A.
8. The above informal description of merge-sort will now be translated into a formal
algorithm which will be divided into two parts.
9. The first part will be a procedure MERGEPASS, which uses Procedure 9.5 to
execute a single pass of the algorithm; and the second part will repeatedly apply
MERGEPASS until A is sorted.
10. The MERGEPASS procedure applies to an n-element array A which consists of a
sequence of sorted sub arrays.
11. Moreover, each sub array consists of L elements except that the last sub array
may have fewer than L elements.
12. Dividing n by 2 * L, we obtain the quotient Q, which tells the number of pairs of
L-element sorted sub arrays; that is,
Q = INT(N/(2*L))
13. (We use INT(X) to denote the integer value of X.) Setting S = 2*L*Q, we get the total
number S of elements in the Q pairs of sub arrays.
14. Hence R = N — S denotes the number of remaining elements.
15. The procedure first merges the initial 0 pairs of L-element sub arrays.
16. Then the procedure takes care of the case where there is an odd number of sub
arrays (when R ≤ L) or where the last sub array has fewer than L elements.
17. The formal statement of MERGEPASS and the merge-sort algorithm follow:
Procedure 9.6: MERGEPASS(A, N, L, B)
The N-element array A is composed of sorted sub arrays where each
sub array has L elements except possibly the last sub array, which may
have fewer than L elements. The procedure merges the pairs of sub
arrays of A and assigns them to the array B.
1. Set Q:= INT(N/(2*L)), S:=2*L*Q and R:= N - S.
2. Use Procedure 9.5 to merge the 0 pairs of sub arrays.]
Repeat for J =1, 2,. . . ,Q:
(a) Set LB : = 1 + (2*J — 2)*L. Finds lower bound of first array.]
(b)Call MERGE(A, L, LB, A, L, LB+L, B, LB).
[End of loop.]
3. [Only one sub array left?]
If R≤L, then:
Repeat for J=1,2,. . . ,R:
Set B(S + J):= A(S + J).
[End of loop.]
Else:
Call MERGE(A, L, S + 1, A, R, L + S + 1, B, S + 1).
[End of If structure.]
4. Return.
Since we want the sorted array to finally appear in the original array A, we must execute
the procedure MERGEPASS an even number of times.
Let f(n) denote the number of comparisons needed to sort an n-element array A using
the merge-sort algorithm. Recall that the algorithm requires at most log n passes. Moreover, each pass merges
a total of n elements, and by the discussion on the complexity of merging, each pass will require at most n
comparisons. Accordingly, for both the worst case and average case,
f(n) ≤ n log n
Observe that this algorithm has the same order as heap sort and the same average order as quick sort. The
main drawback of merge-sort is that it requires an auxiliary array with n elements. Each of the other sorting
algorithms we have studied requires only a finite number of extra locations, which is independent of n.
The above results are summarized in the following table:
RADIX SORT
1. Radix sort is the method that many people intuitively use or begin to use when
alphabetizing a large list of names.
2. (Here the radix is 26, the 26 letters of the alphabet.)
3. Specifically, the list of names is first sorted according to the first letter of each
name.
4. That is, the names are arranged in 26 classes, where the first class consists of
those names that begin with “A,” the second class consists of those names that
begin with “B,” and so on.
5. During the second pass, each class is alphabetized according to the second letter
of the name. And so on. If no name contains, for example, more than 12 letters,
the names are alphabetized with at most 12 passes.
6. The radix sort is the method used by a card sorter. A card sorter contains 13
receiving pockets labeled as follows:
7. 9,8,7,6, 5,4, 3,2, 1,0, 11, 12, R(reject)
8. Each pocket other than R corresponds to a row on a card in which a hole can be
punched. Decimal numbers, where the radix is 10, are punched in the obvious
way and hence use only the first 10 pockets of the sorter.
9. The sorter uses a radix reverse-digit sort on numbers.
10. That is, suppose a card sorter is given a collection of cards where each card
contains a 3-digit number punched in columns Ito 3.
11. The cards are first sorted according to the units digit. On the second pass, the
cards are sorted according to the tens digit.
12. On the third and last pass, the cards are sorted according to the hundreds digit.
We illustrate with an example.
EXAMPLE 9.8
Suppose 9 cards are punched as follows:
348, 143, 361, 423, 538, 128, 321, 543, 366
Given to a card sorter, the numbers would be sorted in three phases, as pictured in Fig.9-6:
(a) In the first pass, the units digits are sorted into pockets. (The pockets are pictured upside
down, so 348 is at the bottom of pocket 8.) The cards are collected pocket by pocket, from
pocket 9 to pocket 0. (Note that 361 wil1 now he at the bottom of the pile and 128 at the
top of the pile.) The cards are now reinput to the sorter.
(b) In the second pass, the tens digits are sorted into pockets. Again the cards are collected
pocket by pocket and reinput to the sorter.
(c) In the third and final pass, the hundreds digits are sorted into pockets.
When the cards are collected after the third pass, the numbers are in the following
order:
128, 143, 321, 348, 361, 366, 423, 538, 543
Thus the cards are now sorted.
The number C of comparisons needed to sort nine such 3-digit numbers is bounded as
follows:
C≤9*3*10
The 9 comes from the nine cards, the 3 comes from the three digits in each number,
and the 10 comes from radix d = 10 digits.
Searching Files,
Searching Pointers
a) Suppose a file F of records R1, R2, . . . , RN is stored in memory.
b) Searching F usually refers to finding the location LOC in memory of the record with a
given key value relative to a primary key field K.
c) One way to simplify the searching is to use an auxiliary sorted array of pointers, as
discussed in Sec. 9.2.
d) Then a binary search can be used to quickly find the location LOC of the record with
the given key.
e) In the case where there is a great deal of inserting and deleting of records in the file,
one might want to use an auxiliary binary search tree rather than an auxiliary
sorted array.
f) In any case, the searching of the file F is reduced to the searching of a collection S of
items, as discussed above.
HASHING
1. The search time of each algorithm discussed so far depends on the number n of
elements in the collection S of data.
2. This section discusses a searching technique, called hashing or hash addressing,
which is essentially independent of the number n.
3. The terminology which we use in our presentation of hashing will be oriented
toward file management.
4. First of all, we assume that there is a file F of n records with a set K of keys
which uniquely determine the records in F.
5. Secondly, we assume that F is maintained in memory by a table T of m memory
locations and that L is the set of memory addresses of the locations in T.
6. For notational convenience, we assume that the keys in K and the addresses in L
are (decimal) integers.
7. (Analogous methods will work with binary integers or with keys which are
character strings, such as names, since there are standard ways of representing
strings by integers.)
8. The subject of hashing will be introduced by the following example.
EXAMPLE
1. Suppose a company with 68 employees assigns a 4-digit employee number to
each employee which is used as the primary key in the company’s employee file.
2. We can, in fact, use the employee number as the address of the record in
memory. The search will require no comparisons at all.
3. Unfortunately, this technique will require space for 10000 memory locations,
whereas space for fewer than 30 such locations would actually be used.
4. Clearly, this tradeoff of space for time is not worth the expense.
5. The general idea of using the key to determine the address of a record is an
excellent idea, but it must be modified so that a great deal of space is not
wasted.
6. This modification takes the form of a function H from the set K of keys into the
set L of memory addresses. Such a function,
7. H:K→L
8. is called a hash function or hashing function. Unfortunately, such a function H
may not yield distinct values: it is possible that two different keys k1 and k2 will
yield the same hash address.
9. This situation is called collision, and some method must be used to resolve it.
10. Accordingly, the topic of hashing is divided into two parts:
11. (1) hash functions and
12. (2) Collision resolutions. We discuss these two parts separately.
Hash Functions
1. The two principal criteria used in selecting a hash function H: K→ L are as follows.
2. First of all, the function H should be very easy and quick to compute.
3. Second the function H should, as far as possible, uniformly distribute the hash
addresses throughout the set L so that there are a minimum number of
collisions.
4. Naturally, there is no guarantee that the second condition can be completely
fulfilled without actually knowing beforehand the keys and addresses.
5. However, certain general techniques do help.
6. One technique is to “chop” a key k into pieces and combine the pieces in some
way to form the hash address H(k). (The term “hashing” comes from this
technique of “chopping” a key into pieces.)
7. We next illustrate some popular hash functions. We emphasize that each of these
hash functions can be easily and quickly evaluated by the computer.
8. (a) Division method. Choose a number m larger than the number n of keys
in K. (The number m is usually chosen to be a prime number or a number
without small divisors, since this frequently minimizes the number of collisions.)
The hash function H is defined by
H(k)=k (mod m) or H(k)=k (mod m) + 1
Here k (mod m) denotes the remainder when k is divided by m. The second
formula is used when we want the hash addresses to range from 1 to m rather
than from 0 to m — 1.
(b) Mid square method. The key k is squared. Then the hash function H is
defined by H(k) = 1
where I is obtained by deleting digits from both ends of k2. We emphasize that
the same positions of k2 must be used for all of the keys.
(c) Folding method. The key k is partitioned into a number of parts, k1,... kr,
where each part, except possibly the last, has the same number of digits as
the required address. Then the parts are added together, ignoring the last
carry. That is, H(k)=k1+k2+...+kr
where the leading-digit carries, if any, are ignored. Sometimes, for extra
“milling,” the even-numbered parts, k2, k4,. . . , are each reversed before the
addition.
EXAMPLE
Collision Resolution
1. Suppose we want to add a new record R with key k to our file F, but suppose the
memory location address H(k) is already occupied. This situation is called
collision.
2. This subsection discusses two general ways of resolving collisions. The particular
procedure that one chooses depends on many factors.
3. One important factor is the ratio of the number n of keys in K (which is the
number of records in F) to the number m of hash addresses in L. This ratio, A = n
/ m, is called the load factor.
4. First we show that collisions are almost impossible to avoid. Specifically, suppose
a student class has 24 students and suppose the table has space for 365
records.
5. One random hash function is to choose the student’s birthday as the hash
address. Although the load factor A = 24/365 7% is very small, it can be shown
that there is a better than fifty-fifty chance that two of the students have the
same birthday.
6. The efficiency of a hash function with a collision resolution procedure is
measured by the average number of probes (key comparisons) needed to find the
location of the record with a given key k.
7. The efficiency depends mainly on the load factor A. specifically, we are
interested in the following two quantities:
S( A) = average number of probes for a successful search
U( A) = average number of probes for an unsuccessful search
These quantities will be discussed for our collision procedures.
EXAMPLE 9.11
1. Suppose the table T has 11 memory locations, T[1], T[2],……T[11], and suppose
the file F consists of 8 records, A, B, C, D, E, X, Y and Z, with the following hash
addresses:
2. Record: A, B, C, D, E, X, Y, Z
3. H(k): 4, 8, 2, 11, 4, 11, 5,1
4. Suppose the 8 records are entered into the table T in the above order. Then the
file F will appear in memory as follows:
5. Table T: X, C, Z, A, E, Y,__, B,__,__, D
6. Address: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
7. Although Y is the only record with hash address H(k) = 5, the record is not
assigned to T[5], since T[5] has already been filled by E because of a previous
collision at T[4]. Similarly, Z does not appear in T[1].
8. The average number S of probes for a successful search follows:
9.
1 + 1 + 1 + 1 + 2 + 2 + 2 + 3 = 13 = 1.6
S= 8 8
U = 7 + 6 + 5 + 4 + 3 + 2 + 1 + 2 + 1 + 1 + 8 = 40 = 3.6
11 11
11. The first sum adds the number of probes to find each of the 8 records, and the
second sum adds the number of probes to find an empty location for each of the 11
locations.
12. One main disadvantage of linear probing is that records tend to cluster, that is,
appear next to one another, when the load factor is greater than 50 percent.
13. Such a clustering substantially increases the average search time for a record. Two
techniques to minimize clustering are as follows:
(1) Quadratic probing. Suppose a record R with key k has the hash address H(k) =
h. Then, instead of searching the locations with addresses h, h + 1, h + 2, . . . ,
we linearly search the locations with addresses
h, h + 1, h + 4, h + 9, h + 16,……..h + i2,….
If the number m of locations in the table T is a prime number, then the above
sequence will access half of the locations in T.
(2) Double hashing. Here a second hash function H’ is used for resolving a
collision, as follows. Suppose a record R with key k has the hash addresses H(k)
= h and H’(k) = h’≠ m. Then we linearly search the locations with addresses
h, h+h’, h+2h’, h+3h’,……
If m is a prime number, then the above sequence will access all the locations in
the table T.
13. Remark: One major disadvantage in any type of open addressing procedure is in
the implementation of deletion. Specifically, suppose a record R is deleted from
the location Tin.
14. Afterwards, suppose we meet T[r] while searching for another record R’. This
does not necessarily mean that the search is unsuccessful.
15. Thus, when deleting the record R, we must label the location T[r] to indicate that
it previously did contain a record.
16. Accordingly, open addressing may seldom be used when a file F is constantly
changing.
Chaining
1. Chaining involves maintaining two tables in memory.
2. First of all, as before, there is a table T in memory which contains the records in
F, except that T now has an additional field LINK which is used so that all records
in T with the same hash address h may be linked together to form a linked list.
3. Second, there is a hash address table LIST which contains pointers to the linked
lists in T.
4. Suppose a new record R with key k is added to the file F. We place R in the first
available location in the table T and then add R to the linked list with pointer
LIST~H(k)].
5. If the linked lists of records are not sorted, than R is simply inserted at the
beginning of its linked list.
6. Searching for a record or deleting a record is nothing more than searching for a
node or deleting a node from a linked list, as discussed in Chap. 5.
7. The average number of probes, using chaining, for a successful search and for
an unsuccessful search are known to be the following approximate values:
8. Here the load factor A = n /m may be greater than 1, since the number m of hash
addresses in L (not the number of locations in T) may be less than the number n of
records in F.
EXAMPLE 9.12
1. Consider again the data in Example 9.11, where the 8 records have the following
hash addresses:
2. Record: A, B, C, D, E, X, Y, Z
3. H(k): 4, 8, 2, 11, 4, 11, 5, 1
4. Using chaining, the records will appear in memory as pictured in Fig. 9-7.
Observe that the location of a record R in table T is not related to its hash
address.
5. A record is simply put in the first node in the AVAIL list of table T. In fact, table T
need not have the same number of elements as the hash address table.
6. The main disadvantage to chaining is that one needs 3m memory cells for the
data. Specifically, there are m cells for the information field INFO, there are m
cells for the link field LINK, and there are m cells for the pointer array LIST.
7. Suppose each record requires only 1 word for its information field. Then it may
be more useful to use open addressing with a table with 3m locations, which has
the load factor A ≤ 1/3, than to use chaining to resolve collisions.
Supplementary Problems
SORTING
9.1 Write a subprogram RANDOM(DATA, N, K) which assigns N random integers between 1 and K to the array DATA.
9.2 Translate insertion sort into a subprogram INSERTSORT(A, N) which sorts the array A with N elements. Test the program using:
(a) 44, 33, 11, 55, 77, 90, 40, 60, 99, 22, 88, 66
(b) D, A, T, A, S. T, R, U. C. T, U. R. E. S
9.3 Translate insertion sort into a subprogram INSERTCOUNT(A, N, NUMB) which sorts the array A with N elements and which also counts the number NUMB of comparisons.
9.4 Write a program TESTINSERT(N, AVE) which repeats 500 times the procedure INSERTCOUNT(A, N, NUMB) and which finds the average AVE of the 500 values of NUMB. (Theoretically,
2 2
AVE N /4.) Use RANDOM(A, N, 5*N) from Prob. 9.1 as each input. Test the program using N = 100 (so, theoretically, AVE N 14 = 2500).
9.5 Translate quick sort into a subprogram QUICKCOUNT(A, N, NUMB) which sorts the array A with N elements and which also counts the number NUMB of comparisons. (See Sec. 6.5.)
9.6 Write a program TESTQUICKSORT(N, AVE) which repeats QUICKCOUNT(A, N, NUMB) 500 times and which finds the average AVE of the 500 values of NUMB. (Theoretically, AVE N log2
N.) Use RANDOM(A, N, 5*N) from Prob. 9.1 as each input. Test the program using N = 100 (so, theoretically, AVE 700).
9.7 Translate Procedure 9.2 into a subprogram MIN(A, LB, UB, LOG) which finds the location LOC of the smallest elements among A[LB], A[LB + 1],..., A[UB].
9.8 Translate selection sort into a subprogram SELECTSORT(A, N) which sorts the array with N elements. Test the program using:
(a) 44, 33, 11, 55, 77, 90, 40, 60, 99, 22, 88, 66
(b) D, A, T, A, S, T, R, U,C, T, U, R, E, S
SEARCHING, HASHING
9.9 Suppose an unsorted linked list is in memory. Write a procedure
which (a) finds the location LOG of ITEM in the list or sets LOC : = NULL for an unsuccessful search and (b) when the search is successful, interchanges ITEM with the element in front of it. (Such a list is said to
be self-organizing. It has the property that elements which are frequently accessed tend to move to the beginning of the list.)
9.10 Consider the following 4-digit employee numbers (see Example 9.10):
Find the 2-digit hash address of each number using (a) the division method, with m = 97; (b) the mid square method; (c) the folding method without reversing; and (d) the folding method with
reversing.
9.11 Consider the data in Example 9.11. Suppose the 8 records are entered into the table Tin the reverse order Z, Y, X, E, D, C, B, A. (a) Show how the file F appears in memory. (b) Find the average
number S of probes for a successful search and the average number U of probes for an unsuccessful search. (Compare with the corresponding results in Example 9.11.)
9.12 Consider the data in Example 9.12 and Fig. 9-7. Suppose the following additional records are added to the file:
(Here the left entry is the record and the right entry is the hash address.) (a) Find the updated tables T and LIST. (b) Find the average number S of probes for a successful search and the average
9.13 Write a subprogram MID(KEY, HASH) which uses the mid square method to find the 2-digit hash address HASH of a 4-digit employee number key.
9.14 Write a subprogram FOLD( KEY, HASH) which uses the folding method with reversing to find the 2-digit hash address HASH of a 4-digit employee number key.
End of UNIT 8