Unit 8

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

C & DS

UNIT – VIII

Sorting & Searching : Searching Methods- Linear and


binary search methods, Sorting methods- Ex: Bubble sort,
Selection sort, Insertion sort, heap sort, quick sort.

Sorting and Searching


Sorting and searching are fundamental operations in computer science.
1. Sorting refers to the operation of arranging data in some given
order, such as increasing or decreasing, with numerical data, or
alphabetically, with character data.
2. Searching refers to the operation of finding the location of a given
item in a collection of items.
There are many sorting and searching algorithms. Some of them, such as heap sort and binary search, have already been discussed throughout the text. The particular algorithm
one chooses depends on the properties of the data and the operations one may perform on the data. Accordingly, we will want to know the complexity of each algorithm; that is,
we will want to know the running time f(n) of each algorithm as a function of the number n of input items. Sometimes, we will also discuss the space requirements of our
algorithms.

3. Sorting and searching frequently apply to a file of records,


so we recall some standard terminology. Each record in a file F can contain many fields, but there may be one particular field whose values uniquely determine the records in
the file. Such a field K is called a primary key, and the values k1, k2, . . . in such a field are called keys or key values. Sorting the file F usually refers to sorting F with respect to
a particular primary key, and searching in F refers to searching for the record with a given key value.
This chapter will first investigate sorting algorithms and then investigate searching algorithms. Some texts treat searching before sorting.

SORTING
1. Let A be a list of n elements A1, A2, . . . , An in memory.
2. Sorting A refers to the operation of rearranging the contents of A so that they are
increasing in order (numerically or lexicographically) that is, so that
3. A1 ≤ A2 ≤ A3 ≤ ……. An
4. Since A has n elements, there are n! ways that the contents can appear in A.
5. These ways correspond precisely to the n! permutations of 1, 2, . . . , n. Accordingly, each sorting algorithm must
take care of these n! possibilities.

EXAMPLE
1. Suppose an array DATA contains 8 elements as follows:
2. DATA: 77, 33, 44, 11, 88, 22, 66, 55
3. After sorting, DATA must appear in memory as follows:
4. DATA: 11, 22, 33, 44, 55, 66, 77, 88
5. Since DATA consists of 8 elements, there are 8! =40,320 ways that the
numbers 11, 22,. ,88 can appear in DATA.

Complexity of Sorting Algorithms


1. The complexity of a sorting algorithm measures the running time as a function of the
number n of items to be sorted.
2. We note that each sorting algorithm S will be made up of the following operations,
where A , A2,……., An contain the items to be sorted and B is an auxiliary location:
(a) Comparisons, which test whether Ai < Aj or test whether Ai < B
(b) Interchanges, which switch the contents of Ai and Aj or of Ai and B
(c) Assignments, which set B := Ai and then set Aj := B or Aj := Ai

3. Normally, the complexity function measures only the number of comparisons, since
the number of other operations is at most a constant factor of the number of
comparisons.
4. There are two main cases whose complexity we will consider; the worst case and the
average case. In studying the average case, we make the probabilistic assumption that
all the n! permutations of the given n items are equally likely. (The reader is referred
to Sec. 2.5 for a more detailed discussion of complexity.)
5. Previously, we have studied the bubble sort (Sec. 4.6), quick sort (Sec. 6.5) and heap
sort (Sec. 7.10). The approximate number of comparisons and the order of complexity
of these algorithms are summarized in the following table:

6. Note first that the bubble sort is a very slow way of sorting; its main
advantage is the simplicity of the algorithm.
7. Observe that the average-case complexity (n log n) of heap sort is the same
as that of quick sort, but its worst - case complexity (n log n) seems quicker
than quick sort (n2).
8. However, empirical evidence seems to indicate that quick sort is superior to
heap sort except on rare occasions.

Lower Bounds
1. The reader may ask whether there is an algorithm which can sort n items in time
of order less than O(n log n).
2. The answer is no. The reason is indicated below. Suppose S is an algorithm
which sorts n items a1, a2, . . . , an.
3. We assume there is a decision tree T corresponding to the algorithm S such that
T is an extended binary search tree where the external nodes correspond to the
n! ways that n items can appear in memory and where the internal nodes
correspond to the different comparisons that may take place during the
execution of the algorithm S.
4. Then the number of comparisons in the worst case for the algorithm S is equal to
the length of the longest path in the decision tree T or, in other words, the depth
D of the tree, T.
5. Moreover, the average number of comparisons for the algorithm S is equal to the
average external path length E of the tree T.
6. Figure 9-1 shows a decision tree T for sorting n = 3 items. Observe that the as n!
= 3! = 6 external nodes. The values of D and E for the tree follow:
7. D=3 and E= 1/6 (2+3+3+3+3+2)=2.667
8. Consequently, the corresponding algorithm S requires at most (worst case) D = 3
comparisons and, on the average, E = 2.667 comparisons to sort the n = 3 items.
9. Accordingly, studying the worst- case and average-case complexity of a sorting
algorithm S is reduced to studying the values of D and E in the corresponding
decision tree T.
10. First, however, we recall some facts about extended binary trees (Sec. 7.11).
Suppose T is an extended binary tree with N external nodes, depth D and
external path length E(T). Any such tree cannot have more than 2D external
nodes, and so
11. 2D ≥ N or equivalently D ≥ log N
12. Furthermore, T will have a minimum external path length E(L) among all such
trees with N nodes when T is a complete tree. In such a case,
13. E(L)=N log N + 0(N )≥ N log N
14. The N log N comes from the fact that there are N paths with length log N or log
N + 1, and the 0(N) comes from the fact that there are at most N nodes on the
deepest level.
15. Dividing E(L) by the number N of external paths gives the average external path
length E. Thus, for any extended binary tree T with N external nodes,

16. Now suppose T is the decision tree corresponding to a sorting algorithm S which
sorts n items. Then T has n! external nodes.
17. Substituting n! for N in the above formulas yields
18. D ≥ log n ! ≅ n log n and E ≥ log n ! = n log n
19. The condition log n’ n log n comes from Stirlings formula. That

20. Thus n log n is a lower bound for both the worst case and the average case.
21. In other words, 0(n log n) is the best possible for any sorting algorithm which
sorts n items.
Sorting Files;
Sorting Pointers

Fig. 9-1 Decision tree T for sorting n = 3 items.

1. Suppose a file F of records R1, R2,……….. Rn is stored in memory.


2. “Sorting F” refers to sorting F with respect to some field K with corresponding
values k1. k2,……. kn.
3. That is. the records are ordered so that
4. k1 ≤ k2 ≤ ….. ≤ Kn
5. The field K is called the sort key.
6. (Recall that K is called a primary key if its values uniquely determine the records
in F.)
7. Sorting the file with respect to another key will order the records in another way.

EXAMPLE 9.2
1. Suppose the personnel file of a company contains the following data on each of
its employees:
2. Name Social Security Number Sex Monthly Salary
3. Sorting the file with respect to the Name key will yield a different order of the
records than sorting the file with respect to the Social Security Number key.
4. The company may want to sort the file according to the Salary field even though
the field may not uniquely determine the employees.
5. Sorting the file with respect to the Sex key will likely be useless; it simply
separates the employees into two sub files, one with the male employees and
one with the female employees.
6. Sorting a file F by reordering the records in memory may be very expensive when
the records are very long.
7. Moreover, the records may be in secondary memory, where it is even more time-
consuming to move records into different locations.
8. Accordingly, one may prefer to form an auxiliary array POINT containing pointers
to the records in memory and then sort the array POINT with respect to a field
KEY rather than sorting the records themselves.
9. That is, we sort POINT so that
10. KEY[POINT[1]] ≤ KEY[POINT[2]] ≤ . .. ≤ KEY[POINT[N]]
11. Note that choosing a different field KEY will yield a different order of the array
POINT.
12.
EXAMPLE 9.3

1. Figure 9-2(a) shows a personnel file of a company in memory. Figure 9-2(b)


shows three arrays, POINT, PTRNAME and PTRSSN.
2. The array POINT contains the locations of the records in memory, PTRNAME
shows the pointers sorted according to the NAME field, that is,
3. NAME [PTRNAME[1]] < NAME[PTRNAME[2]]< ….. < NAME[PTRNAME[9]]
4.

Fig. 9 - 2

and PTRSSN shows the pointers sorted according to the SSN field, that is,
5. SSN[PTRSSN[1]] <SSN{PTRS5N[2]] <••…. <SSN[PTRSSN[9]]
6. Given the name (EMP) of an employee, one can easily find the location of NAME in
memory using the array PTRNAME and the binary search algorithm.
7. Similarly, given the social security number NUMB of an employee, one can easily
find the location of the employee’s record in memory by using the array PTRSSN
and the binary search algorithm.
8. Observe, also, that it is not even necessary for the records to appear in successive
memory locations. Thus inserting and deleting records can easily be done.

INSERTION SORT
1. Suppose an array A with n elements A[1], A[2],..., A[N] is in memory.
2. The insertion sort algorithm scans A from A[1] to A[N], inserting each element
A[K] into its proper position in the previously sorted sub array A[1],
A[2]………A[K—1]. That is:
3. Pass 1. A[1] by itself is trivially sorted.
4. Pass 2. A[2] is inserted either before or after A[ti] so that: A[1], A[2] is sorted.
5. Pass 3. A[3] is inserted into its proper place in A[1], A[2], that is, before A[1],
between A[1J and A[2], or after A[2], so that: A[1], A[2], A[3] is sorted.
6. Pass 4. A[4] is inserted into its proper place in A[1], A[2], A[3] so that:
7. A[1], A[2], A[3], A[4] is sorted.
8. ……………………………………………………………………………………
9. Pass N. A[N] is inserted into its proper place in A[1], A[2]……A[N-1] so that:A[1],
A[2], . . . , A[N] is sorted.
10. This sorting algorithm is frequently used when n is small. For example, this
algorithm is very popular with bridge players when they are first sorting their
cards.
11. There remains only the problem of deciding how to insert A[K] in its proper place
in the sorted sub array A[l], A[2],……A[K—1].
12. This can be accomplished by comparing A[K] with A[K-1], comparing A[K] with
A[K—2], comparing A[K] with A[K-3], and so on, until first meeting an element
A[J] such that A[J] ≤ A[K]. Then each of the elements A[K-1], A[K—2],. . . ,A[J+1] is
moved forward one location, and A[K] is then inserted in the J+lst position in the
array.
13. The algorithm is simplified if there always is an element A[J] such that A[J] ≤ A[K];
otherwise we must constantly check to see if we are comparing A[K] with A[1].
14. This condition can be accomplished by introducing a sentinel element A[0] = - ∞
(or a very small number).

Fig. 9-3 Insertion sort for n = 8 items


EXAMPLE 9.4

1. Suppose an array A contains 8 elements as follows:


2. 77, 33, 44, 11, 88, 22, 66, 55
3. Figure 9-3 illustrates the insertion sort algorithm. The circled element indicates the
AEK] in each pass of the algorithm, and the arrow indicates the proper place for I
nserting A[K].
4. The formal statement of our insertion sort algorithm follows.
(Insertion Sort) INSERTION(A, N).
This algorithm sorts the array A with N elements.
1. Set A[0] := - ∞. [Initializes sentinel element.]
2. Repeat Steps 3 to 5 for K = 2, 3,…….. N:
3. Set TEMP : = A[K] and PTR : = K – 1.
4. Repeat while TEMP < A[PTR]:
(a) Set A[PTR + 1] : = A [PTR]. [ Moves element forward.]
(b) Set PTR : = PTR + 1.
[End of loop.]
5. Set A[PTR + 1] : = TEMP. [ Inserts element in proper place.]
End of Step 2 loop.]
6. Return.
Observe that there is an inner loop which is essentially controlled by the variable PTR,
and there is an outer loop which uses K as an index.

Complexity of Insertion Sort


The number f(n) of comparisons in the insertion sort algorithm can be easily computed. First of
all, the worst case occurs when the array A is in reverse order and the inner loop must use the
maximum number K - 1 of comparisons. Hence
f(n) = 1+2+……..+(n - 1) = n(n—l) = 0(n2)
2
Furthermore, one can show that, on the average, there will be approximately (K — 1)/2
comparisons in the inner loop. Accordingly, for the average case,
F(n) = 1/2 + 2/2 + …….. + (n – 1) = n (n – 1) = 0(n2)
2 4
Thus the insertion sort algorithm is a very slow algorithm when n is very large. The above results
are summarized in the following table:

Remark: Time may be saved by performing a binary search, rather than a linear search, to find
the location in which to insert A[K] in the sub array Atli, A[2],. . . , A[K — 1]. This
requires, on the average, log K comparisons rather than (K — 1)/ 2 comparisons.
However, one still needs to move (K — 1)/ 2 elements forward. Thus the order of
complexity is not changed. Furthermore, insertion sort is usually used only when n in
small, and in such a case, the linear search is about as efficient as the binary search\

SELECTION SORT
1. Suppose an array A with n elements A[1], A[2], . . . , A[N] is in memory.
2. The selection sort algorithm for sorting A works as follows.
3. First find the smallest element in the list and put it in the first position.
4. Then find the second smallest element in the list and put it in the second position.
5. And so on. More precisely:
Pass 1. Find the location LOC of the smallest in the list of N elements
A[1], A[2],. . . , A[N], and then interchange A[LOC] and A[1]. Then:
A[1] is sorted.
Pass 2. Find the location LOC of the smallest in the sub ljst of N — 1 elements
A[2], A[3]……..A[N], and then interchange A[LOC] and A[2].
Then: A[1], A[2] is sorted, since A[1] ≤ A[2].
Pass 3. Find the location LOC of the smallest in the sub ljst of N — 2
elements A[3], A[4],... , A[N], and then interchange A[LOC] and
A[3]. Then:
A[1], A[2]….., A[3] is sorted, since A[2] ≤ A[3].
…………………………………………………….
……………………………………………………
Pass N — 1.Find the location LOC of the smaller of the elements A[N — 1],
A[N], and then interchange A[LOC] and A[N — 1]. Then:
A[1], A[2], . . . , A[N] is sorted, since A[N — 1] ≤ A[N].
Thus A is sorted after N — 1 passes.

Example 9.5
1. Suppose an array A contains 8 elements as follows:
77, 33, 44, 11, 88, 22, 66, 55
2. Applying the selection sortalgorithm to A yields the data in Fig. 9-4.
3. Observe that LOC gives the location of the smallest among A[K], A[K +
1],……..A[N] during Pass K.
4. The circled elements indicate the elements which are to be interchanged.
5. There remains only the problem of finding, during the Kth pass, the location LOC
of the smallest among the elements A[K], A[K + 1],………. A[N].
6. This may be accomplished by using a variable MIN to hold the current smallest
value while scanning the sub array from A[K] to A[N].
7. Specifically, first set MIN := A[K] and LOC := K, and then traverse the list,
comparing MIN with each other element A[j] as follows:

Fig. 9-4 Selection sort for n = 8 items.


a) If MIN ≤ A[J], then simply move to the next element.
(b) If MIN>A[J], then update MIN and LOC by setting MIN:=A[J] and LOC:=J.
8. After comparing MIN with the last element A[N], MIN will contain the smallest
among the elements A[K], A[K + 1],. . . , A[N] and LOC will contain its location.
9. The above process will be stated separately as a procedure.
10.
Procedure 9.2: MIN(A, K, N, LOC)
An array A is in memory. This procedure finds the location LOC of the
smallest element among A[K], A[K + 1], . . . , A[N].
1. Set MIN:=A[K] and LOC:=K. [Initializes pointers.]
. 2. Repeat for J = K + 1, K + 2, . . . , N:
If MIN>A[J], then: Set MIN:=A[J] and LOC:=A[J] and LOC:=J.
[End of loop. ]
3. Return.
The selection sort algorithm can now be easily stated:
Algorithm 9.3: (Selection Sort) SELECTION(A, N)
This algorithm sorts the array A with N elements.
1.Repeat Steps 2 and 3 for K=1,2,……………….,N – 1:
2. Call MIN(A, K, N, LOC).
3. Interchange A[K] and A[LOC].]
Set TEMP:= A[K], A[K]: = A[LOC] and A[LQC]: = TEMP
[End of Step 1 loop.]
4. Exit.
Complexity of the Selection Sort Algorithm

1. First note that the number f(n) of comparisons in the selection sort algorithm is
independent of the original order of the elements Observe that MIN(A, K, N, LOC)
requires n K comparisons
2. That is, there are n — 1 comparisons during Pass 1 to find the smallest element,
there are n — 2 comparisons during Pass 2 to find the second smallest element,
and so on. Accordingly,
3. f(n) = (n – 1) + (n -2) + …… + 2 + 1 = n (n -1) = 0(n2)
4. The above result is summarized in the following table:

Remark: The number of interchanges and assignments does depend on the original
order of the elements in the array A, but the sum of these operations does not exceed a
factor of n2.

MERGING
1. Suppose A is a Sorted list with r elements and B is a sorted list with s elements
2. The operation that combines the elements of A and B into a single sorted list C
with n = r + s elements is called merging
3. One simple way to merge is to place the elements of B after the elements of A
and then use some sorting algorithm on the entire list.
4. This method does not take advantage of the fact that A and 13 are individually
sorted. A much more efficient algorithm is Algorithm 9.4 in this section.
5. First, however, we indicate the general idea of the algorithm by means of two
examples.
6. Suppose one is given two sorted decks of cards.
7. The decks are merged as in Fig. 9-5.
8. That is, at each step, the two front cards are compared and the smaller one is
placed in the combined deck.
9. When one of the decks is empty, all of the remaining cards in the other deck are
put at the end of the combined deck.
10. Similarly, suppose we have two lines of students sorted by increasing heights,
and suppose ~e want to merge them into a single sorted line.
11. The new line is formed by choosing. at each step. the shorter of the two students
who are at the head of their respective lines.
12. When one of the lines has no more students, the remaining students line up at
the end of the combined line.

13. The above discussion will now be translated into a formal algorithm which
merges a sorted r – element array A and a sorted s – element array B into a
sorted array C, with n = r + s elements.
14. First of all, we must always keep track of the locations of the smallest element of
A and the smallest element of B which have not yet been placed in C.
15. Let NA and NB denote these locations, respectively. Also, let PTR denote the
location inC to be filled. Thus, initially, we set NA =1, NB land PTR := ~ At each
step of the algorithm, we compare
16. A[NA] and B[NB]
17. and assign the smaller element to C[PTR]. Then we increment PTR by setting
PTR:= PTR + 1, and we either increment NA by setting NA:~ NA + 1 or increment
NB by setting NB:= NB + 1, according to whether the new element in C has come
from A or from B.
18. Furthermore, if NA> r, then the remaining elements of B are assigned to C; or if
NB > s, then the remaining elements of A are assigned to C.
19. The formal statement of the algorithm follows.

Algorithm 9.4: MERGING(A, R, B, 5, C)


Let A and B be sorted arrays with R and S elements, respectively. This
algorithm merges A and B into an array C with N = R + S elements.

1.[Initialize.] Set NA : = 1, NB : = 1 and PTR : = 1.


2.[Compare .] Repeat while NA ≤ R and NB ≤ S:
If A[NA] < B[NB], then:
(a) [Assign element from A to C .] Set C[PTR]: = A[NA].
(b) [Update pointers.] Set PTR : = PTR + 1 and NA : =
NA+1.
Else
(a) [Assign element from B to C.] Set C[PTR] : = B[NB].
(b) [Update pointers .] Set PTR : = PTR + 1 and NB : = NB +
1.
[End of If structure.]
[End of loop.]
3. [Assign remaining elements to C.]
If NA > R, then:
Repeat for K = 0, 1, 2,…………. , S – NB:
Set C[PTR + K] : = B [NB + K].
[End of loop.]
Else:
Repeat for K = 0,1,2,……………, R – NA:
Set C[PTR + K] : = A [NA + K].
[End of loop.]
[End of If structure.]
4. Exit.

Complexity of the Merging Algorithm


The input consists of the total number n = r + s of elements in A and B. Each comparison
assigns an element to the array C, which eventually has n elements. Accordingly, the
number f(n) of comparisons cannot exceed n:
f(n) ≤ n = 0(n)
In other words, the merging algorithm can be run in linear time.
Non regular Matrices

1. Suppose A, B and C are matrices, but not


necessarily regular matrices.
2. Assume A is sorted, with r elements and lower
bound LBA; B is sorted, with s elements and
lower bound LBB; and C has lower bound LBC.
Then UBA = LBA + r — 1 and UBB = LBB + s — 1
are, respectively, the upper bounds of A and B.
3. Merging A and B now may be accomplished by
modifying the above algorithm as follows.

Procedure 9.5: MERGE(A, R, LBA, S, LBB, C, LBC)


This procedure merges the sorted arrays A and B into the
array C.
1. Set NA:=LBA, NB:=LBB, PTR:=LBC, UBA:=LBA+R-1 UBB :=
LBB + S - 1.
2. Same as Algorithm 9.4 except R is replaced by UBA and S
by UBB.
3. Same as Algorithm 9.4 except R is replaced by UBA and S
by UBB.
4. Return.

4. Observe that this procedure is called MERGE,


whereas Algorithm 9.4 is called MERGING.
5. The reason for stating this special case is that
this procedure will be used in the next section,
on merge-sort.

Binary Search and Insertion Algorithm


1. Suppose the number r of elements in a sorted array A is much smaller than the
number s of elements in a sorted array B.
2. One can merge A with B as follows. For each element A[K] of A, use a binary
search on B to find the proper location to insert A[K] into B.
3. Each such search requires at most log s comparisons; hence this binary search
and insertion algorithm to merge A and B requires at most r logs comparisons.
4. We emphasize that this algorithm is more efficient than the usual merging
Algorithm 9.4 only when r < <s, that is, when r is much less than s.

EXAMPLE 9.6

1. Suppose A has 5 elements and suppose B has 100 elements.


2. Then merging A and B by Algorithm 9.4 uses approximately 100 comparisons On
the other hand, only approximately log 100 = 7 comparisons are needed to find
the proper place to insert an element of A into B using a binary search.
3. Hence only approximately 5 * 7 = 35 comparisons are need to merge A and B
using the binary search and insertion algorithm.
4. The binary search and insertion algorithm does not take into account the fact
that A is sorted.
5. Accordingly, the algorithm may be improved in two ways as follows. (Here we
assume that A has 5 elements and B has 100 elements.)
(1) Reducing the target set. Suppose after the first search we find that A[l] is to
be inserted after B[16]. Then we need only use a binary search on
B[17]………. B[100] to find the proper location to insert A[2]. And so on.
(2) Tabbing. The expected location for inserting All] in B is near B[20] (that is,
B[s/r]), not near B[50]. Hence we first use a linear search on B[20], B[40],
B[60], B[80] and B[100] to find B[K] such that A[1] ≤ B[K], and then we use a
binary search on B[K- 20], B[K - 19] ……… B[K]. (This is analogous to using
the tabs in a dictionary which indicate the location of all words with the same
first letter.)
The details of the revised algorithm are left to the reader.

MERGE-SORT
Suppose an array A with n elements A[1], A[2],……….. A[N] is in memory. The
merge-sort algorithm which sorts A will first be described by means of a specific
example.

EXAMPLE 9.7

1. Suppose the array A contains 14 elements as follows:


2. 66, 33, 40, 22, 55, 88, 60, 11, 80, 20, 50, 44, 77, 30
3. Each pass of the merge-sort algorithm will start at the beginning of the array A and
merge pairs of sorted sub arrays as follows:
Pass 1. Merge each pair of elements to obtain the following list of sorted pairs:
33, 66 22, 40 55, 88 11,60 20, 80 44, 50 30, 77
Pass 2. Merge each pair of pairs to obtain the following list of sorted quadruplets:
22, 33,40,66 11, 55, 60, 88 20, 44, 50, 80 30, 77
Pass 3. Merge each pair of sorted quadruplets to obtain the following two
Sorted sub arrays:
11, 22, 33, 40, 55, 60, 66, 88 20, 30, 44, 50, 77, 80
Pass 4. Merge the two sorted sub arrays to obtain the single sorted array
11, 20, 22, 30, 33, 40, 44, 50, 55, 60, 66, 77, 80, 88
The original array A is now sorted.

5. The above merge-sort algorithm for sorting an array A has the following
important property.
6. After Pass K, the array A will be partitioned into sorted sub arrays where each
sub array, except possibly the last, will contain exactly L = 2K elements.
7. Hence the algorithm requires at most log n passes to sort an n-element array A.
8. The above informal description of merge-sort will now be translated into a formal
algorithm which will be divided into two parts.
9. The first part will be a procedure MERGEPASS, which uses Procedure 9.5 to
execute a single pass of the algorithm; and the second part will repeatedly apply
MERGEPASS until A is sorted.
10. The MERGEPASS procedure applies to an n-element array A which consists of a
sequence of sorted sub arrays.
11. Moreover, each sub array consists of L elements except that the last sub array
may have fewer than L elements.
12. Dividing n by 2 * L, we obtain the quotient Q, which tells the number of pairs of
L-element sorted sub arrays; that is,
Q = INT(N/(2*L))
13. (We use INT(X) to denote the integer value of X.) Setting S = 2*L*Q, we get the total
number S of elements in the Q pairs of sub arrays.
14. Hence R = N — S denotes the number of remaining elements.
15. The procedure first merges the initial 0 pairs of L-element sub arrays.
16. Then the procedure takes care of the case where there is an odd number of sub
arrays (when R ≤ L) or where the last sub array has fewer than L elements.
17. The formal statement of MERGEPASS and the merge-sort algorithm follow:
Procedure 9.6: MERGEPASS(A, N, L, B)
The N-element array A is composed of sorted sub arrays where each
sub array has L elements except possibly the last sub array, which may
have fewer than L elements. The procedure merges the pairs of sub
arrays of A and assigns them to the array B.
1. Set Q:= INT(N/(2*L)), S:=2*L*Q and R:= N - S.
2. Use Procedure 9.5 to merge the 0 pairs of sub arrays.]
Repeat for J =1, 2,. . . ,Q:
(a) Set LB : = 1 + (2*J — 2)*L. Finds lower bound of first array.]
(b)Call MERGE(A, L, LB, A, L, LB+L, B, LB).
[End of loop.]
3. [Only one sub array left?]
If R≤L, then:
Repeat for J=1,2,. . . ,R:
Set B(S + J):= A(S + J).
[End of loop.]
Else:
Call MERGE(A, L, S + 1, A, R, L + S + 1, B, S + 1).
[End of If structure.]
4. Return.

Algorithm 9.7: MERGESORT(A, N)


This algorithm sorts the N- element array A using an auxiliary array B.
1. Set L := 1. [Initializes the number of elements in the sub arrays.]
2. Repeat Steps 3 to 6 while L < N:
3. Call MERGEPASS (A, N, L, B).
4. Call MERGEPASS (B, N, 2 * L, A).
5. Set L : = 4 * L.
[End of Step 2 loop.]
6. Exit.

Since we want the sorted array to finally appear in the original array A, we must execute
the procedure MERGEPASS an even number of times.

Complexity of the Merge-Sort Algorithm

Let f(n) denote the number of comparisons needed to sort an n-element array A using
the merge-sort algorithm. Recall that the algorithm requires at most log n passes. Moreover, each pass merges
a total of n elements, and by the discussion on the complexity of merging, each pass will require at most n
comparisons. Accordingly, for both the worst case and average case,
f(n) ≤ n log n
Observe that this algorithm has the same order as heap sort and the same average order as quick sort. The
main drawback of merge-sort is that it requires an auxiliary array with n elements. Each of the other sorting
algorithms we have studied requires only a finite number of extra locations, which is independent of n.
The above results are summarized in the following table:

RADIX SORT
1. Radix sort is the method that many people intuitively use or begin to use when
alphabetizing a large list of names.
2. (Here the radix is 26, the 26 letters of the alphabet.)
3. Specifically, the list of names is first sorted according to the first letter of each
name.
4. That is, the names are arranged in 26 classes, where the first class consists of
those names that begin with “A,” the second class consists of those names that
begin with “B,” and so on.
5. During the second pass, each class is alphabetized according to the second letter
of the name. And so on. If no name contains, for example, more than 12 letters,
the names are alphabetized with at most 12 passes.
6. The radix sort is the method used by a card sorter. A card sorter contains 13
receiving pockets labeled as follows:
7. 9,8,7,6, 5,4, 3,2, 1,0, 11, 12, R(reject)
8. Each pocket other than R corresponds to a row on a card in which a hole can be
punched. Decimal numbers, where the radix is 10, are punched in the obvious
way and hence use only the first 10 pockets of the sorter.
9. The sorter uses a radix reverse-digit sort on numbers.
10. That is, suppose a card sorter is given a collection of cards where each card
contains a 3-digit number punched in columns Ito 3.
11. The cards are first sorted according to the units digit. On the second pass, the
cards are sorted according to the tens digit.
12. On the third and last pass, the cards are sorted according to the hundreds digit.
We illustrate with an example.
EXAMPLE 9.8
Suppose 9 cards are punched as follows:
348, 143, 361, 423, 538, 128, 321, 543, 366
Given to a card sorter, the numbers would be sorted in three phases, as pictured in Fig.9-6:
(a) In the first pass, the units digits are sorted into pockets. (The pockets are pictured upside
down, so 348 is at the bottom of pocket 8.) The cards are collected pocket by pocket, from
pocket 9 to pocket 0. (Note that 361 wil1 now he at the bottom of the pile and 128 at the
top of the pile.) The cards are now reinput to the sorter.
(b) In the second pass, the tens digits are sorted into pockets. Again the cards are collected
pocket by pocket and reinput to the sorter.
(c) In the third and final pass, the hundreds digits are sorted into pockets.
When the cards are collected after the third pass, the numbers are in the following
order:
128, 143, 321, 348, 361, 366, 423, 538, 543
Thus the cards are now sorted.
The number C of comparisons needed to sort nine such 3-digit numbers is bounded as
follows:
C≤9*3*10
The 9 comes from the nine cards, the 3 comes from the three digits in each number,
and the 10 comes from radix d = 10 digits.

Complexity of Radix Sort


Suppose a list A of n items A1, A2, . . . , An is given. Let d denote the radix (e.g., d = 10 for decimal digits, d = 26 for letters and d = 2
for bits), and suppose each item A. is represented by means of s of the digits:
Ai = di1 di2 …..dis
The radix sort algorithm will require s passes, the number of digits in each item. Pass K will compare each dik with each of the d
digits. Hence the number C(n) of comparisons for the algorithm is bounded as follows:
C(n) ≤ d*s*n
Although d is independent of n, the number s does depend on n. In the worst case, s = n, so C(n) = 0(n2). In the best case, s = logs n,
so C(n) = 0(n log n). In other words, radix sort performs well only when the number s of digits in the representation of the A1’s is
small.
Another drawback of radix sort is that one may need d*n memory locations. This comes from the fact that all the items may be “sent
to the same pocket” during a given pass. This drawback may be minimized by using linked lists rather than arrays to store the items
during a given pass. However, one will still require 2*n memory locations.

SEARCHING AND DATA MODIFICATION


1. Suppose S is a collection of data maintained in memory by a table using some
type of data structure.
2. Searching is the operation which finds the location LOC in memory of some given
ITEM of information or sends some message that ITEM does not belong to S.
3. The search is said to be successful or unsuccessful according to whether ITEM
does or does not belong to S.
4. The searching algorithm that is used depends mainly on the type of data
structure that is used to maintain S in memory.
5. Data modification refers to the operations of inserting, deleting and updating.
6. Here data modification will mainly refer to inserting and deleting.
7. These operations are closely related to searching, since usually one must search
for the location of the ITEM to be deleted or one must search for the proper
place to insert ITEM in the table.
8. The insertion or deletion also requires a certain amount of execution time, which
also depends mainly on the type of data structure that is used.
9. Generally speaking, there is a tradeoff between data structures with fast
searching algorithms and data structures with fast modification algorithms.
10. This situation is illustrated below, where we summarize the searching and data
modification of three of the data structures previously studied in the text.
11. (1) Sorted array. Here one can use a binary search to find the location LOC
of a given ITEM in time 0(log n).
12. On the other hand, inserting and deleting are very slow, since, on the average,
n/2 = 0(n) elements must be moved for a given insertion or deletion.
13. Thus a sorted array would likely be used when there is a great deal of searching
but only very little data modification.
14. (2) Linked list. Here one can only perform a linear search to find the location
LOC of a given ITEM, and the search may be very, very slow, possibly requiring
time 0(n).
15. On the other hand, inserting and deleting requires only a few pointers to be
changed.
16. Thus a linked list would be used when there is a great deal of data modification,
as in word (string) processing
17. (3) Binary search tree. This data structure combines the advantages of the
sorted array and the linked list. That is, searching is reduced to searching only
a certain path P in the tree T, which, on the average, requires only 0(log n)
comparisons.
18. Furthermore, the tree T is maintained in memory by a linked representation, so
only certain pointers need be changed after the location of the insertion or
deletion is found.
19. The main drawback of the binary search tree is that the tree may be very
unbalanced, so that the length of a path P may be 0(n) rather than 0(log n).
20. This will reduce the searching to approximately a linear search.
21. Remark: The above worst-case scenario of a binary search tree may be
eliminated by using a height-balanced binary search tree that is rebalanced after
each insertion or deletion.
22. The algorithms for such rebalancing are rather complicated and lie beyond the
scope of this text.

Searching Files,
Searching Pointers
a) Suppose a file F of records R1, R2, . . . , RN is stored in memory.
b) Searching F usually refers to finding the location LOC in memory of the record with a
given key value relative to a primary key field K.
c) One way to simplify the searching is to use an auxiliary sorted array of pointers, as
discussed in Sec. 9.2.
d) Then a binary search can be used to quickly find the location LOC of the record with
the given key.
e) In the case where there is a great deal of inserting and deleting of records in the file,
one might want to use an auxiliary binary search tree rather than an auxiliary
sorted array.
f) In any case, the searching of the file F is reduced to the searching of a collection S of
items, as discussed above.

HASHING

1. The search time of each algorithm discussed so far depends on the number n of
elements in the collection S of data.
2. This section discusses a searching technique, called hashing or hash addressing,
which is essentially independent of the number n.
3. The terminology which we use in our presentation of hashing will be oriented
toward file management.
4. First of all, we assume that there is a file F of n records with a set K of keys
which uniquely determine the records in F.
5. Secondly, we assume that F is maintained in memory by a table T of m memory
locations and that L is the set of memory addresses of the locations in T.
6. For notational convenience, we assume that the keys in K and the addresses in L
are (decimal) integers.
7. (Analogous methods will work with binary integers or with keys which are
character strings, such as names, since there are standard ways of representing
strings by integers.)
8. The subject of hashing will be introduced by the following example.

EXAMPLE
1. Suppose a company with 68 employees assigns a 4-digit employee number to
each employee which is used as the primary key in the company’s employee file.
2. We can, in fact, use the employee number as the address of the record in
memory. The search will require no comparisons at all.
3. Unfortunately, this technique will require space for 10000 memory locations,
whereas space for fewer than 30 such locations would actually be used.
4. Clearly, this tradeoff of space for time is not worth the expense.
5. The general idea of using the key to determine the address of a record is an
excellent idea, but it must be modified so that a great deal of space is not
wasted.
6. This modification takes the form of a function H from the set K of keys into the
set L of memory addresses. Such a function,
7. H:K→L
8. is called a hash function or hashing function. Unfortunately, such a function H
may not yield distinct values: it is possible that two different keys k1 and k2 will
yield the same hash address.
9. This situation is called collision, and some method must be used to resolve it.
10. Accordingly, the topic of hashing is divided into two parts:
11. (1) hash functions and
12. (2) Collision resolutions. We discuss these two parts separately.

Hash Functions
1. The two principal criteria used in selecting a hash function H: K→ L are as follows.
2. First of all, the function H should be very easy and quick to compute.
3. Second the function H should, as far as possible, uniformly distribute the hash
addresses throughout the set L so that there are a minimum number of
collisions.
4. Naturally, there is no guarantee that the second condition can be completely
fulfilled without actually knowing beforehand the keys and addresses.
5. However, certain general techniques do help.
6. One technique is to “chop” a key k into pieces and combine the pieces in some
way to form the hash address H(k). (The term “hashing” comes from this
technique of “chopping” a key into pieces.)
7. We next illustrate some popular hash functions. We emphasize that each of these
hash functions can be easily and quickly evaluated by the computer.
8. (a) Division method. Choose a number m larger than the number n of keys
in K. (The number m is usually chosen to be a prime number or a number
without small divisors, since this frequently minimizes the number of collisions.)
The hash function H is defined by
H(k)=k (mod m) or H(k)=k (mod m) + 1
Here k (mod m) denotes the remainder when k is divided by m. The second
formula is used when we want the hash addresses to range from 1 to m rather
than from 0 to m — 1.
(b) Mid square method. The key k is squared. Then the hash function H is
defined by H(k) = 1
where I is obtained by deleting digits from both ends of k2. We emphasize that
the same positions of k2 must be used for all of the keys.
(c) Folding method. The key k is partitioned into a number of parts, k1,... kr,
where each part, except possibly the last, has the same number of digits as
the required address. Then the parts are added together, ignoring the last
carry. That is, H(k)=k1+k2+...+kr
where the leading-digit carries, if any, are ignored. Sometimes, for extra
“milling,” the even-numbered parts, k2, k4,. . . , are each reversed before the
addition.

EXAMPLE

Consider the company in Example 9.9, each of whose 68 employees is assigned a


unique 4-digit employee number. Suppose L consists of 100 two-digit addresses: 00, 01,
02 99. We apply the above hash functions to each of the following employee
numbers: 3205, 7148, 2345
(a) Division method. Choose a prime number m close to 99, such as m = 97. Then
H(3205) = 4, H(7148) = 67, H(2345) = 17
That is, dividing 3205 by 97 gives a remainder of 4, dividing 7148 by 97 gives a
remainder of 67, and dividing 2345 by 97 gives a remainder of 17. In the case that
the memory addresses begin with 01 rather than 00, we choose that the function
H(k) = k(mod m) + 1 to obtain:
H(3205)=4+ 1=5, H(7148)=67+ 1=68, H(2345)=17+ 1=18
(b) Mid square method. The following calculations are performed:
k: 3205 7148 2345
k2: 10272025 51093904 5499025
H(k): 72 93 99
Observe that the fourth and fifth digits, counting from the right, are chosen for
the hash address.
(c) Folding method. Chopping the key k into two parts and adding yields the following
hash addresses:
H(3205)=32+05=37, H(7148)=71+48= 19, H(2345)=23+45=68
Observe that the leading digit 1 in H(7148) is ignored. Alternatively, one may want
to reverse the second part before adding, thus producing the following hash
addresses:
H(3205)=32+50=82, H(7148)=71+84+55, H(2345)=23+54=77

Collision Resolution

1. Suppose we want to add a new record R with key k to our file F, but suppose the
memory location address H(k) is already occupied. This situation is called
collision.
2. This subsection discusses two general ways of resolving collisions. The particular
procedure that one chooses depends on many factors.
3. One important factor is the ratio of the number n of keys in K (which is the
number of records in F) to the number m of hash addresses in L. This ratio, A = n
/ m, is called the load factor.
4. First we show that collisions are almost impossible to avoid. Specifically, suppose
a student class has 24 students and suppose the table has space for 365
records.
5. One random hash function is to choose the student’s birthday as the hash
address. Although the load factor A = 24/365 7% is very small, it can be shown
that there is a better than fifty-fifty chance that two of the students have the
same birthday.
6. The efficiency of a hash function with a collision resolution procedure is
measured by the average number of probes (key comparisons) needed to find the
location of the record with a given key k.
7. The efficiency depends mainly on the load factor A. specifically, we are
interested in the following two quantities:
S( A) = average number of probes for a successful search
U( A) = average number of probes for an unsuccessful search
These quantities will be discussed for our collision procedures.

Open Addressing: Linear Probing and Modifications


1. Suppose that a new record R with key k is to be added to the memory table T,
but that the memory location with hash address H(k) = h is already filled.
2. One natural way to resolve the collision is to assign R to the first available
location following T[h]. (We assume that the table T with m locations is circular,
so that T[1] comes after T[m].)
3. Accordingly, with such a collision procedure, we will search for the record R in
the table T by linearly searching the locations T[h], T[h +1], T[h + 2], . . . until
finding R or meeting an empty location, which indicates an unsuccessful search.
4. The above collision resolution is called linear probing.
5. The average numbers of probes for a successful search and for an unsuccessful
search are known to be the following respective quantities:
(Here = n/m is the load factor.)

EXAMPLE 9.11

1. Suppose the table T has 11 memory locations, T[1], T[2],……T[11], and suppose
the file F consists of 8 records, A, B, C, D, E, X, Y and Z, with the following hash
addresses:
2. Record: A, B, C, D, E, X, Y, Z
3. H(k): 4, 8, 2, 11, 4, 11, 5,1
4. Suppose the 8 records are entered into the table T in the above order. Then the
file F will appear in memory as follows:
5. Table T: X, C, Z, A, E, Y,__, B,__,__, D
6. Address: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
7. Although Y is the only record with hash address H(k) = 5, the record is not
assigned to T[5], since T[5] has already been filled by E because of a previous
collision at T[4]. Similarly, Z does not appear in T[1].
8. The average number S of probes for a successful search follows:
9.
1 + 1 + 1 + 1 + 2 + 2 + 2 + 3 = 13 = 1.6
S= 8 8

10. The average number U of probes for an unsuccessful search follows:

U = 7 + 6 + 5 + 4 + 3 + 2 + 1 + 2 + 1 + 1 + 8 = 40 = 3.6
11 11
11. The first sum adds the number of probes to find each of the 8 records, and the
second sum adds the number of probes to find an empty location for each of the 11
locations.
12. One main disadvantage of linear probing is that records tend to cluster, that is,
appear next to one another, when the load factor is greater than 50 percent.
13. Such a clustering substantially increases the average search time for a record. Two
techniques to minimize clustering are as follows:
(1) Quadratic probing. Suppose a record R with key k has the hash address H(k) =
h. Then, instead of searching the locations with addresses h, h + 1, h + 2, . . . ,
we linearly search the locations with addresses
h, h + 1, h + 4, h + 9, h + 16,……..h + i2,….
If the number m of locations in the table T is a prime number, then the above
sequence will access half of the locations in T.
(2) Double hashing. Here a second hash function H’ is used for resolving a
collision, as follows. Suppose a record R with key k has the hash addresses H(k)
= h and H’(k) = h’≠ m. Then we linearly search the locations with addresses
h, h+h’, h+2h’, h+3h’,……
If m is a prime number, then the above sequence will access all the locations in
the table T.
13. Remark: One major disadvantage in any type of open addressing procedure is in
the implementation of deletion. Specifically, suppose a record R is deleted from
the location Tin.
14. Afterwards, suppose we meet T[r] while searching for another record R’. This
does not necessarily mean that the search is unsuccessful.
15. Thus, when deleting the record R, we must label the location T[r] to indicate that
it previously did contain a record.
16. Accordingly, open addressing may seldom be used when a file F is constantly
changing.

Chaining
1. Chaining involves maintaining two tables in memory.
2. First of all, as before, there is a table T in memory which contains the records in
F, except that T now has an additional field LINK which is used so that all records
in T with the same hash address h may be linked together to form a linked list.
3. Second, there is a hash address table LIST which contains pointers to the linked
lists in T.
4. Suppose a new record R with key k is added to the file F. We place R in the first
available location in the table T and then add R to the linked list with pointer
LIST~H(k)].
5. If the linked lists of records are not sorted, than R is simply inserted at the
beginning of its linked list.
6. Searching for a record or deleting a record is nothing more than searching for a
node or deleting a node from a linked list, as discussed in Chap. 5.
7. The average number of probes, using chaining, for a successful search and for
an unsuccessful search are known to be the following approximate values:

8. Here the load factor A = n /m may be greater than 1, since the number m of hash
addresses in L (not the number of locations in T) may be less than the number n of
records in F.

EXAMPLE 9.12

1. Consider again the data in Example 9.11, where the 8 records have the following
hash addresses:
2. Record: A, B, C, D, E, X, Y, Z
3. H(k): 4, 8, 2, 11, 4, 11, 5, 1
4. Using chaining, the records will appear in memory as pictured in Fig. 9-7.
Observe that the location of a record R in table T is not related to its hash
address.
5. A record is simply put in the first node in the AVAIL list of table T. In fact, table T
need not have the same number of elements as the hash address table.
6. The main disadvantage to chaining is that one needs 3m memory cells for the
data. Specifically, there are m cells for the information field INFO, there are m
cells for the link field LINK, and there are m cells for the pointer array LIST.
7. Suppose each record requires only 1 word for its information field. Then it may
be more useful to use open addressing with a table with 3m locations, which has
the load factor A ≤ 1/3, than to use chaining to resolve collisions.

Supplementary Problems

SORTING

9.1 Write a subprogram RANDOM(DATA, N, K) which assigns N random integers between 1 and K to the array DATA.

9.2 Translate insertion sort into a subprogram INSERTSORT(A, N) which sorts the array A with N elements. Test the program using:

(a) 44, 33, 11, 55, 77, 90, 40, 60, 99, 22, 88, 66
(b) D, A, T, A, S. T, R, U. C. T, U. R. E. S

9.3 Translate insertion sort into a subprogram INSERTCOUNT(A, N, NUMB) which sorts the array A with N elements and which also counts the number NUMB of comparisons.

9.4 Write a program TESTINSERT(N, AVE) which repeats 500 times the procedure INSERTCOUNT(A, N, NUMB) and which finds the average AVE of the 500 values of NUMB. (Theoretically,
2 2
AVE N /4.) Use RANDOM(A, N, 5*N) from Prob. 9.1 as each input. Test the program using N = 100 (so, theoretically, AVE N 14 = 2500).

9.5 Translate quick sort into a subprogram QUICKCOUNT(A, N, NUMB) which sorts the array A with N elements and which also counts the number NUMB of comparisons. (See Sec. 6.5.)
9.6 Write a program TESTQUICKSORT(N, AVE) which repeats QUICKCOUNT(A, N, NUMB) 500 times and which finds the average AVE of the 500 values of NUMB. (Theoretically, AVE N log2

N.) Use RANDOM(A, N, 5*N) from Prob. 9.1 as each input. Test the program using N = 100 (so, theoretically, AVE 700).

9.7 Translate Procedure 9.2 into a subprogram MIN(A, LB, UB, LOG) which finds the location LOC of the smallest elements among A[LB], A[LB + 1],..., A[UB].

9.8 Translate selection sort into a subprogram SELECTSORT(A, N) which sorts the array with N elements. Test the program using:

(a) 44, 33, 11, 55, 77, 90, 40, 60, 99, 22, 88, 66

(b) D, A, T, A, S, T, R, U,C, T, U, R, E, S

SEARCHING, HASHING
9.9 Suppose an unsorted linked list is in memory. Write a procedure

SEARCH(INFO, LINK, START, ITEM, LOG)

which (a) finds the location LOG of ITEM in the list or sets LOC : = NULL for an unsuccessful search and (b) when the search is successful, interchanges ITEM with the element in front of it. (Such a list is said to

be self-organizing. It has the property that elements which are frequently accessed tend to move to the beginning of the list.)

9.10 Consider the following 4-digit employee numbers (see Example 9.10):

9614, 5882, 6713, 4409, 1825

Find the 2-digit hash address of each number using (a) the division method, with m = 97; (b) the mid square method; (c) the folding method without reversing; and (d) the folding method with

reversing.

9.11 Consider the data in Example 9.11. Suppose the 8 records are entered into the table Tin the reverse order Z, Y, X, E, D, C, B, A. (a) Show how the file F appears in memory. (b) Find the average

number S of probes for a successful search and the average number U of probes for an unsuccessful search. (Compare with the corresponding results in Example 9.11.)

9.12 Consider the data in Example 9.12 and Fig. 9-7. Suppose the following additional records are added to the file:

(P, 2), (Q, 7), (R, 4), (5, 9)

(Here the left entry is the record and the right entry is the hash address.) (a) Find the updated tables T and LIST. (b) Find the average number S of probes for a successful search and the average

number U of probes for an un successful search.

9.13 Write a subprogram MID(KEY, HASH) which uses the mid square method to find the 2-digit hash address HASH of a 4-digit employee number key.

9.14 Write a subprogram FOLD( KEY, HASH) which uses the folding method with reversing to find the 2-digit hash address HASH of a 4-digit employee number key.

End of UNIT 8

You might also like