Download as rtf, pdf, or txt
Download as rtf, pdf, or txt
You are on page 1of 120

1 2Variations on

Backpropagation
Objectives 12-1 Theory and Examples 12-2 Drawbacks of Backpropagaticn 12-3 Performance Surface Example 12-3 Convergence Example 12-7 Heuristic Modilicalions to Backpropagation Momentum 12-9 Varrable Learning Rate 12-12 Numerical Optimization Techniques12-14 Conjugate Gradient 12-14 Levenberg-Marquardt Algorithm 12-19 Summary of Results 12-28 Solved Problems 12-32 Epilo9ue 12-46 Further Reading 12-47 Exercises 12-50

12-9

Objectives
The backpropagation algorithm introduced in Chapter 11 was a major breakthrough in neural network research. However, the basic algorithm is too slow for most practical applications In this chapter we present several variations of backpropagation that provide significant speedup and make the algorithm more practical. We will begin by using a fimction approximation example to illustrate why the backpropagation algorithm is slow in converging. Then we will present several modifications to the algorithm. Recall that backpropagation is an approximate steepest descent algorithm. In Chapter 9 we saw that steepest descent is the simplest, and often the slowest, minimization method. The conjugate gradient algorithm and Newton's method generally provide fast121

12I

,.

erconvergence. In this chapter we wilt explain how these faster procedures can be used to speed up the convergence of backpropagation.

12-

,riatkns on Baekpropagation

Theory and Examples


Whe n the basi c bac kpro pag atio n algo rith m is appl ied to a prac tical pro blem the train ing may take day s or wee ks of com pute r time . This has encorn age d cons

12I

,.

ider able rese arch on met hod s to acce lerat e the con verg enc e of the algo rith m. The rese arch on fast er algo rith ms falls roug hly into two cate gori es. The first cate gory invol ves the dev elop men t of heur istic tech

12-

niqu es, whic h aris e out of a stud y of the disti ncti ve perf orm anc e of the stan dard bac kpro pag ation algo rith m. The se heur istic tech niqu es incl ude suc h idea s as vary ing the lear ning rate , usin g mo

12I

,.

men tum and resc alin g vari able s (e.g ., [Vo Ma8 8}, [Jac obS SJ, [To1 190] and [RiIr 9OJ) . In this cha pter we will disc uss the use of mo men tum and vari able lear ning rate s. Ano ther cate gory of rese arch has

12-

focu sed on stan dard num eric al opti miz ation tech niqu es (e.g. , [Sha n9O J, [Bar n92] , [Bat t92] and [Cha r92} ). As we hav e disc uss ed in Cha pter s 10 and 11, trai ning feed forw ard neu ral networ ks to

12I

,.

mini miz e squ ared erro r is sim ply a num eric al opti miz atio n pro blem. Bec aus e num eric al opti miz atio n has bee n an imp orta nt rese arch subj ect for 30 or 40 year s (see Cha pter 9), it see ms

12-

reas ona ble to look for fast train ing algo rith ms in the larg e num ber of exis ting num eric al opti mizati on tech niqu es. Ther e is no nee d to rein vent the whe el" unle ss abs olut ely nec essa ry. In this cha

12I

,.

pter we will pres ent two exis ting num eric al opti miz atio n tech niqu es that hav e bee n very succ essf ully appl ied to the trai ning of mult ilay er perc eptr ons: the conj ugat e grad ient algo rith m and the Lev

12-

enb ergMar quar dt algo rith m (a vari atio n of New ton' s met hod) . We sho uld emp hasi ze that all of the algo rith ms that we will desc ribe in this cha pter use the bac kpro pag atio n proc edur e, in whic h deri vati

12I

,.

ves are process ed floni the last laye r of the net wor k to the first. For this reas on they coul d all be calle d "bac kpro pag atio n" algo rith ms. The diffe renc es bet wee n the algo rith ms occ ur in the way in

12-

whi ch the resu lting deri vati ves are use d to upd ate the wei ghts . In som e way s it is unfo rtun ate that the algo rith m we usu ally refe r to as bac kpro pag atio n is in fact a stee pest desc ent alg orith m.

12I

,.

In orde r to clari fy our disc ussi on, for the rem aind er of this cha pter we will refe r to the basi c bac kpro pag atio n algo rith m as stee pest des cent S D B P ba ck pr op ag at io n (S

12-

D B P) . In the nex t sec tion we will use a sim ple exa mpl e to exp lain wh y SD BP has pro ble ms wit h con ver gen ce. The n, in the foll owi ng sec tion s, we will pre sen t vari

12I

,.

ous pro ced ure s to imp rov e the con ver gen ce of the alg orit hm.

12.

Drawbacks of Backpropajj

Drawbacks of Backpropagat ion


Recall from Chapter 10 that the LMS algorithm is guaranteed to converge to a solution that minimizes the mean squared error, so long as the learning rate is not too large. This is true because the mean squared error for a single-layer linear network is a quadratic function. The quadratic function has only a single stationary point. In addition, the Hessian matrix of a quadratic function is constant, therefore the curvature of the function in a given direction does not change, and the function contours are elliptical. SDBP is a generalization of the LMS algorithm. Like LMS, it is also an approximate steepest descent algorithm for minimizing the mean squared error. In fact, SDBP is equivalent to the LMS algorithm when used on a single-layer linear

12.d I,.

network. (See Problem P11.10.) When applied to multi-layer networks, however, the characteristics of SDBP are quite different. This has to do with the differences between the mean squared error performance surfaces of single-layer linear networks and multilayer nonlinear networks. While the performance surface for a single-layer linear network has a single minimum point and constant curvature, the performance surface for a multilayer network may have many local minimum points, and the curvature can vary widely in different regions of the parameter space. This will become clear in the example that follows.

Performanc e Surface Example


To investigat e the mean squared error performa nce surface for multilaye r

12.

networks we will employ a simple function approxim ation example. We will use the 12-1 network shown in Figure 12.1, with logsigmoid transfer functions in both layers. a '

= I o g s i g I ( W ' p -

12.d I,.

i b ' ) a
2

= l o g s i g ( W
2

a
1

+ b
2

F i g u

12.

r e

1 2 . 1

1 2 1

F u n c t i o n

A p

12.d I,.

p r o x i m a t i o n

N e t w o r k In order to simplify our analysis, we will give the network a problem for which we know the

12.

optimal solution, The function we will approxim ate is

12.d I,.

ation8 on flackpropagalioa the response of the same 1-2-1 network, with the following values for the weights and biases: w 1 = 10, w 1 = 10, b; = -5, b = 5,
2 2

(12.1)

W]]

= 1, w 12 = 1' = -1 (12.2) 1, The network response for these parameters is shown in Figure 12.2, which plots the network output a2 as the input p is varied over the range [-2, 2]
1

02

-1
p

Figure 12.2 Nominal Function We want to train the network of Figure 12.1 to approximate the function displayed in Figure 12.2. The approximation will be exact when the network parameters are set to the values given in Eq. (12.1) and Eq. (12.2). This is, of course, a very contrived problem, but it is simple and it illustrates some important concepts.
Let's now consider the performance index for our problem. We will assume that

the function is sampled at the values p -2,-1.9,--J8,..., 1.9,2, (12.3)

and that each occurs with equal probability. The performance index will be the sum of the squared errors at these 41 points. (We won't bother to find the mean squared error, which just requires dividing by 41.) In order to be able to graph the performance index, we will vary only two parameters at a time. Figure 12.3 illustrates the squared error when only w and w are being adjusted, while the other parameters are set to their optimal values given in Eq. (12.1) and Eq. (12.2). Note that the minimum error will be zero, and it will occur when w = 10 and w,1 = I as indicated by the open blue circle in the figure.

12.

Drawbacks of1ackpmpagi

There are several features to notice about this error surface. First, it is clearly not a quadratic function. The curvature varies drastically over the

12
w1

Ii'

/ I 1
S

parameter space. For this reason it will be difficult to choose an appropriate learning rate for the steepest descent algorithm. In some regions the surface is very flat, which would allow a large learning rate, while in other regions the curvature is high, which would require a small learning rate. (Refer to discussions in Chapters 9 and lOon the choice of learning rate for the steepest descent algorithm.) It should be noted that the flat regions of the performance surface should nst be unexpected, given the sigmoid

,.

12.d

transfer functions used by the network. The sigmoid is very flat for large inputs A second feature of this error surface is the existence of more than one local minimum point. The global minimum point is located at = 10 and = 1, along the valley that runs parallel to the w axis. However, there is also a local minimum, which is located in the valley that runs parallel to the axis. (This local minimum is actually off

12.

the graph at = 0.88, W1] = 38.6.) In the next section we will investigate the performance of backpropag ation on this surface. I 'I

Figure 12.3 Squared Error Surface Versus w1 and w Figure 12.4 illustrates the squared error when w and b are being adjusted, while the other parameters are set to their optimal

,.

12.d

vaues. Note tlat the minimum error will be zero, and it will occur when w1 = 10 and = 5, as indicated by the open blue circle in the figure. Again we find that the surface has a very contorted shape, steep in some regions and very flat in others. Surely the standard steepest descent algorithm will have sme trouble with this surface. For example, if we have an initial guess of w 0, b1 = l0,the gradient will be very close to zero,

riations on Backpropagaton

and the steepest descent algorithm would effectively stop, even though it is not close to a local

/ /

--I

\-' /z -

--

Figure 12.4 Squared Error Surface Versus : and b Figure 12.5 illustrates the squared error when b and i4 are being adjusted, while the other paraneters are set 1to their optimal values. The minimum error is located at b] = 5 and b2 5, as indicated by the open blue circle in the figure. This surface illustrates an important property of multilayer networks: they have a symmetry to them. Here we see that there are two local minimum points and they both have the same value of squared error. The second solution corresponds to the same network being turned upside down (i.e., the top neuron in the first layer is exchanged with the bottom neuron). It is because of this eharaeteristic of neural networks that we do not set the initial weights and biases to zero. The symmetry causes zero to be a saddle point of the performance surface. This brief study of the performance surfaces for multilayer networks gives us some hints as to how to set the initial guess for the SDBP algorithm. First, we do not want to set the initial parameters to zero. This is because the origin of the parameter space tends to be a saddle point for the performance surface. Second, we do not want to set the initial parameters to large values. This is because the performance surface tends to have very fiat regions as we move far away from the optimum point. Typically we choose the initial weights and biases to be small random values. In this way we stay away from a possible saddle point at the origin without moving out to the very flat regions of the performance surface. (Another procedure for choosing the initial parameters is described in [NgWi9O].) As we will see in the next section, it is also usefil to try several different initial guesses, in order to be sure that the algorithm converges to a global minimum point.
12-29

,.

Batching

Drewback of Baekpropag

Figure 12.5 Squared Error Surface Versus b and b

Convergence Example
Now that we have examined the performance surface, let's investigate the performance of SDBP. For this section we will use a variation of the standard algorithm, called batching, in which the parameters are updated only a.fter the entire training set has been presented. The gradients calculated at each training example are averaged together to produce a more accurate estimate of the gradient. (If the training set is complete, i.e., covers all possible input/output pairs, then the gradient estimate will be exact.) In Figure 12.6 re see tw trajectories of SDBP (batch mode) when only two parameters, w and w 1 are adjusted. For the initial condition labeled "a" the algorithm does eventually converge to the optimal solution, but the convergence is slow. The reason for the slow convergence is the change in curvature of the surface over the path of the trajectory. After an initial
12-30

moderate slope, the trajectory passes over a very flat surface, until it fails into a very gently sloping valley. If we were to increase the learning rate, th algorithm would converge faster while passing over the initial flat surface, but would become unstable when falling into the valley, as we will see in a moment. Trajectory"? illustrates how the algorithm can converge to a local minimum point. The trajectory is trapped in a valley and diverges from the optinal solution.ff allowed to continue the trajectory converges to = 0.88, W1] = 386. The existence of multiple local minimum points is typical of the performance surface of multilayer networks. For this reason it is best to try several different initial guesses in order to ensure that a global minimum has been obtained. (Some of the local minimum points may have the same value of squared error, as we saw in Figure 12.5, so we would not expect the algorithm to converge to the same parameter values for each initial guess. We just want to be sure that the same minimum error is obtained.)

12-31

12. I, .

?ia(ons on Buckpropagation Figure 12.6 Two SDBP (Batch Mode) Trajectories The progress of the algorithm can also be seen in Figure 12.7, which shows III the squared error versus the iteration number. The curve on the left corresponds to trajectory a2 I and the curve on the right corresponds to trajectory I Ill "b." These curves are typical of SDBP, with long periods of little progress and then short periods of rapid advance. 1
1
1 1D

a .5 4 2 05 0 10 102 Iteraton Nt,nber


L
2

7
0 I0 05 w _
I

0 2 2 10

10

10'

1D

iteration Number

Figure 12.7 Squared Error Convergence Patterns We can see that the flat sections in Figure 12.7 correspond to times when the algorithm is traversing a flat section of the performance surface, as shown in Figure 12.6. During these periods we would like to increase the learning rate, in order to speed up convergence. However, if we increase the learning rate the algorithm will become unstable when it reaches steeper portions of the performance surface. This effect is illustrated in Figure 12.8. The trajectory shown here corresponds to trajectory URr in Figure 12.6, except that a ]arger learning rate was used. The algorithm converges faster at first, but when the trajectory reaches the narrow valley that contains the minimum point the algorithm begins to diverge. This suggests that it would be useful to vary the learning rate. We could increase the learning rate on flat surfaces and then decrease the learning rate as the slope increased. The question is: How will the al-

12Heuriitk Modificaon of

Backpropag?

gorithm know when it is on a flat surface? We will discuss this in a later section. Figure 12.8 Trajectory with Learning Rate Too Large Another way to improve convergence would be to smooth out the trajectory. Note in Figure 12.8 that when the algorithm begins to diverge it is oscillating back and forth across a narrow valley. If we could filter the trajectory, by averaging the updates to the parameters, this might smooth out the osdilations and produce a stable trectory. We will discuss this procedure in the next section. To experiment with this backpropagation exampie, se the Neura Network Design Denwnstration Steepest Descent Backpropagation (nndl 2sd).

12

Heuristic Modifications of Backpropagation


Now that we have investigated some of the drawbacks of backpropagation (steepest descent), let's consider some procedures for improving the algorithm. An this section we will discuss two heuristic methods. In a later section we will present two methods based on standard numerical optirnization algorithms.

Momentum
The first method we will discuss is the use of momentum. This is a modification based on our observation in the last section that convergence might be improved if we could smooth out the oscillations in the trajectory. We can do this with a low-pass filter. Before we apply momentum to a neural network application, let's investigate a simple example to illustrate the smoothing effect. Consider the following first-order filter:

12-

,. atirna on Bac*prupagation
I

y(k) rni(k-l)+(1-y)w(k),.

where w (k) is the input to the filter, y (lc) is the output of the filter and y is the momentum coefficient that must
satislS'

0~y< 1.

(12.5)

The effect of this filter is shown in Figure 12.9. For these examples the input to the fluter was taken to be the sine wave: w(k) = 1+sin.1_),
i! and the momentum coefficient was set to y 0.9 (left i lt1 graph) and y ill 'I I

(12.6)

''

. ---- ---- -

II

it

i1

0.98 I 1.1 iII\tab iIIi (right graph). Here we can see that the oscillation of the H I 1 filter output is less than the oscillation in the filter input 1 II I II (as we would expect for aI low-pass filter). In addition, as y I is increased the oscillation in the fluter output is reduced. ? ! ' LIH -III f/hf Notice alsoIthat the average filter output is the same as II I the average filter input, although as y is increased the III I filter output is slower to respond i I; _ ) ii , _____ I, ijI I i 0/ Figure 12.9 Smooththg Effect of iMomentum_ _ b)7=0.98 a)y = 0.9 To summarize, the filter tends to reduce the amount of oscillation, while still tracking the average value. Now let's see how this works on the neural network problem. First, recall that the parameter updates for SL)BP (Eq. (11.46) and Eq. (11.47)) are
I II I o .1 I i i I

,o

rnI

W"(k) = -cts (a
= III

),

(12.7) (12.8)

Heuristk Modifi c a ti ons of Rackpropag

Momentum When the momentum filter is added to the parameter changes, we obtain the following equations for the momentum modification to backpropagaMOBP tion (MOEP):
ir. rni T

yAWm(k-1) - ( l - y ) a s a = yXb"(k-) -(-T)as

),

(12.9) (12.10)

If we now apply these modified equations to the example in the preceding section, we obtain the results shown in Figure 12.10. (For this example we have uced a batching form of MOEP, in which the parameters are updated only after the entire training set has been presented. The gradients calculated at each training example are averaged together to produce a more accurate estimate of the gradient.) This trajectory corresponds to the same initial condition and learning rate shown in Figure 12.8, but with a momentum coefficient of y = 0.8. We can see that the algorithm is now stable. By the use of momentum we have been able to use a larger learning rate, while maintaining the stability of the algorithm. Another feature of momentum is that it tends to accelerate convergence when the trajectory is moving in a consistent direction. w. Figure 12.10 Trajectory with Momentum If you look carefully at the trajectory in Figure 12.10, you can see why the procedure is given the name momentum. It tends to make the trajectory continue in the same direction. The larger the value of y, the more "momentum" the trajectory has.
To experiment with momenturn use the Neural Network Design Denwnstration Momentum Ba ckpropa gation (nndl2mo)

12-35

12.

c'ne
I

.% %

,.

iatwns on Rockpropagation '


l

Variable Learning Rate W e s u g g e s t e d e a r l i e r i n t h i s c h a p t e r t h a t w e m i g h t be a bl e t o s p e e d up convergence if we increase the learning rate on flat surfaces and then decrease the learning rate when the slope increases. In this s e c t i o n w e w a n t t o e x plore this concept. Recall that the mean squared error performance surface for s i n g l e - l a y e r l i ne a r ne t w o r k s i s a l w a y s a qu a d r a t i c f u nc t i o n , a nd t h e Hessian matrix is t h e r e f o r e c on s t a nt . Th e ma x i m u m s t a b l e l e a r n i n g r a t e f o r t h e s t e e p e s t d e s c e nt a l g o r i t h m i s t w o d i v i d e d b y t h e m a x i m u m e i g e n v a i u e o f t h e H e s s i a n matrix. (See Eq. (9.25).) As we have seen, the error surface for the multilayer network is not a quadratic function. The shape of the surface can be veiy different in differe nt regions of the parameter space. Perhaps we can speed up convergence by ad j u s t i ng th e l e arn i n g rat e du ri ng t h e c o u r s e o f t r a i n i n g . T h e t r i c k w i l l b e to determine when to change the learning rate and by how much. There are many different approaches for varying the learning rate. We will describe a very straightforward batching procedure [VoMaS8I. where the Variable Learning Rate learning rate is varied according to the performance of the algorithm. The VLBP rules of the variable learning rate backpropagation algorithm (VLBP) are: 1. If the squared error (over the entire training set) increases by more than some set percentage t (typically one to five percent) after a weight update, then the weight update is discarded, the learning rate is multiplied by some factor 0< p < 1 ,and the momentum c o e f f i c i e n t ' (if it is used) is set to zero. 2. If the squared error decreases after a weight update, then the weight update is accepted and the learning rate is multipl ied by some fact or r> I If 'y has been previously set t o z e r o , i t i s r e s e t t o i t s o r i g i n a l v a l ue. 3. If the squared error increases by less than , then the weight update is accepted but the learning rate is unchanged. If? has been previously set to zero, it is reset to its original value. (See Problem P12.3 for a numerical example of VLBP.) To illustrate YLBP, let's apply it to the function approximation problem of the previous section. Figure 12.11 displays the trajectory for the algorithm using the same initial guess, initial learning rate and momentum coefficient as was used in Figure 12.10. The new parameters were assigned the

Heuristk Modifi c a ti ons of Rackpropag values = 1.05, p = 0.7 and 4 , (12.11)

12-37

12-

Heuritmtic

Modifwaiions of

Baekpropag
- -

I I I

12

Figure 12.11 Variable Learning Rate Trajectory Notice how the learning rate, and therefore the step size, tends to increase when the trajectory is traveling in a straight line with constantly decreasing error. This effect can also be seen in Figure 12.121 which shows the squared error and the learning rate versus iteration number. When the trajectory reaches a narrow valley, the learning rate is rapidly decreased. Otherwise the trajectory would have become oscillatory, and the error would have increased dramatically. For each potential step where the error would have increased by more than 4% the learning rate is reduced and the momentum is eliminated, which allows 40 the trajectory to make the quick turn to Learning Squared Error follow the valley toward theRate minimum point. The learning rate then increases 20 again, which accelerates the convergence. The learning rate is reduced again when10 trajectory the / overshoots the minimum point when the algorithm has almost converged. This process is typical of a VLBP trajectory. 10 Figure 12i2 Convergence 10 i Iteration Number Characteristics of Variable Learning Rate10 Iteration Numter There are many variations on this variable learning rate algorithm. Jacobs [JaeoS8I proposed the delta-bar-delta 'earning rule, in which each network parameter (weight or bias) has its own learning rate. The algorithm increases

1.5

0.5

10

10

the learning rate for a network parameter if the parameter change

12-39

12-

,. 'iations on Dackpropagation
I

has been in the same direction for several iterations. If the direction of the parameter change alternates, then the learning rate is reduced. The SuperSAB algorithm of Tollenaere [To11901 is similar to the deltabar-delta rule, but it has more complex rules for adjusting the learning rates, Another heuristic modification to SDBP is the Quickprop algorithm of Fahlnian [Fah188], It assumes that the error surface is parabolic and concave upward around the minimum point and that the effect of each weight can be considered independently. (References to other SDBP modifications are given in Chapter 19.) The heuristic modifications to SDBP can often provide much faster convergence for some problems. However, there are two main drawbacks to these methods. The first is that the modifications require that several parameters be set (e.g., , p and i'), while the only parameter required for SDBP is the learning rate. Some of the more complex heuristic modifications can have five or six parameters to be selected. Often the performance of the algorithm is sensitive to changes in these parameters. The choice of parameters is also problem dependent. The second drawback to these modifications to SDBP is that they can sometimes faii to converge on problems for which SDBP will eventually find a solution. Both of these drawbacks tend to occur more often when using the more complex algorithms. To arperL'nent with VLBP, use the Neural Network Design Demonstration Variable Learning Rate Backpropagation (sndl2vl).

Numerical Optimization Techniques


Now that we have investigated some of the heuristic modifications to SDBP, let's consider those methods that are based on standard numerical optimization techniques. We will investigate two techniques: conjugate gradient and Levenberg-Marquardt. The conjugate gradient algorithm for quadratic functions was presented in Chapter 9. We need to add two procedures to this algorithm in order to apply it to more general functions. The second numerical optimization method we will discuss in this chapter is the Levenberg-Marquardt algorithm, which is a modification to Newton's method that is well-suited to neural network training.

Conjugate Gradient

In Chapter 9 we presented three numerical optimization techniques: steepest descent, conjugate gradient and Newton's method. Steepest descent is the simplest algorithm, but is often slow in converging. Newton's method is much faster, but requires that the Hessian matrix and its inverse he calculated. The conjugate gradient algorithm is somethingofa compromise; it does not require the calculation of second derivatives, and yet it still has the quadratic convergence property. (It converges to the minimum of a quadratic function in a finite number of iterations.) In this section we will de-

12.41

12-

Numerical Optimization Tech,'&

scribe how the conjugate gradi used to train multilayer networks.

- g ,
0

1 1 2 .1 2 )
where
g
k

(12.15)

c ' F (12.16) ( x ) p

( 1 2 . 1 3 ) 2. Take
a step

accor

ding to Eq. (9.57 ), selec ting the learn ing rate a to mini mize the funct ion alon g the
searc h

direc tion:
X +c.C

k k

(12.14 ) 0. Select the next search direction according to Eq. (9.60), using Eq. (9.61), Eq. (9.62), or Eq. (9.63) to calculate : = with
T T __________ I
g

or = _____
o r p

k _

g k _ I P k _ 1

12.43

12-

g k - l gk - J

g_gk_l

4. if the algorithm

has not converged,


continue from step 2. This conjugate gradient algorithm cannot be applied directly to the neural network training task, because the performan ce index is not quadratic. This affecta the algorithm in two ways. First, we will not be able to use Eq. (9.31) to

minimize the

function along a line, as required instep 2. Second, the exact minimum will not normaily be reached ui a finite number of

steps, and therefore the algorithm will need to be reset after some set number of iterations. Let's address the linear

search first. We need to

have a general procedure for locating the minimum of a function in a specified direction. This will involve two steps: interval location and interval reduction. The purpose of the interval location step is to find some initial interval that contains a local minimum. The interval
12.45

12-

reduction step then reduces the size of the initial interval until the minimum is located to the desired accuracy.

a*ion8 on Backpropagation

We will use a function comparison method [Sca185] to perform the interval Interval Location location step. This procedure is illustrated in Figure 12.1:3. We begin by evaluating the performance index at an initial point, represented by a in the figure. This point corresponds to the current values of the network weights and biases. In other words, we are evaluating F(x3) (12.17)

The next step is to evaluate the function at a second point, represented by b1 in the figure, which is a distance r from the initial point, along the first search direction p0. In other words, we are evaluating F(x 0 +cp 0 ) (12.18)

Figure 12.13 Interval Location We then continue to evaluate the performance index at new points b , successively doubling the distance between points. This process stops when the function increases between two consecutive evaluations, in Figure 12.13 this is represented byb3 to b4 At this point we know that the minimum is bracketed by the two points a5 and b5. We cannot nnnow the interval any further, because the minimum may occur either in the interval [a4, b4J or in the interval [a3, b3] . These two possibilities are illustrated in Figure 12.14 (a). Now that we have located an interval containing the mimmum, the next Interval Reduction step in the linear search is interval reduction. This will involve evaluating the function at points inside the interval b51, which was selected in the interval location step. From Figure 12.14 we can see that we will need to evaluate the function at two internal points (at least) in order to reduce the size of the interval of uncertainty. Figure 12.14 (a) shows that one internal function evaluation does not provide us with any information on the location of the minimum. However, if we evaluate the function at two points c and d, as in Figure 12.14 (b), we can reduce the interval of uncertainty. If
12.47

12-

Numerwal Omzutwn TechJ

12

F(x)

F(c) >F(d)

,as shown in Figure 12.14 (b), a c db then the (b) Mrnimum muse occur minimu between c and b. m must occur in the interval [c, h] Convers ely, if F (c) <F (d) , then the minimu m must occur in the interval Ia, d) . (Note that we are assumin g that there is a singic minimu m located in the initial interval. More

12 .

about that later.)


F(x)

a c b

ta) Int erv al is not red uc ed. Golden Section Search Figure 1214 Reducing the

The procedure described above s the interval of uncertainty. We no cation of the internal points c a (see [Scal85D. We will use a m search, which is designed to r required. At each iteration one n example, ui the case illustrated discarded and point c would bec would be placed between the orig the new point so that the interval as possible.
0 . 6 1 8 Set

c1 =

12-

a+

(1 r)

(b1 a), F
F(c1)

d -

( 1 r ) (
b

r
a t)

,
F d

F ( d )
For k 1, 2, .. . rep eat

I f

12 .

<
F d

t h e n Set
0k+I

a; bk+
= dk;
d

kl =

Ck

Ok i I +

(1t)

Al ) F = F; F = d +I

F(c1) else

,.

12 .

.'?ztioss on Backpropagation

Set
dk#I = bk+I- (1-i)
F

+=

C;

&, C

= dk

(bk+-ak+J)
F d

= Fd; F 1 end

k+l

end

until

bAfl-akI

<t oi

Where tol is the accuracy tolerance set by the user. (See Problem P12.4 for a numerical example of the interval location and interval reduction procedures.) There is one more modification to the conjugate gradient algorithm that needs to be made before we apply it to neural network traniing. For quadratic functions the algorithm will converge to the minimum in at most n iterations, where a is the number of parameters being optimized. The mean squared error performance index for multilayer networks is not quadratic, therefore the algorithm would not normally converge in n iterations. The developmeiit of the conjugate gradLent algorithm does not indicate what search direction to use once a cycle of a iterations has been completed. There have been many procedures suggested, but the simplest method is to reset the search direction to the steepest descent direction (negative of the gradient) after a iterations lScal85J. We will use this metho d . Let's now apply the conjugate gradient algorithm to the function approxiniation example that we have been using to demonstrate the other neural network training algorithms. We will use the backpropagation algorithm to compute the gradient (using Eq. (11.23) and Eq. (11.24)) and the conjugate gradient algorithm to determine the weight updates. This is a batch mode algorithm, as the gradient is computed after the entire training set has been presented to the network. Figure 12.15 shows the intermediate steps of the CGBP algorithm for the first three iterations. The interval location process is illustrated by the open blue circles; each one represents one evaluation of the function. The final interval is indicated by the larger open black circles. The

12 .

black dots in Figure 12.15 indicate the location of the new interior points during the Golden Section search, one for each iteration of the procedure. The final point is indicated by a blue dot. Figure 12.16 shows the total trajectory to convergence. Notice that the CGBP algorithm converges in many fewer iterations than the other algorithms that we have tested. This is a little deceiving, since each iteration of CGBP requires more computations than the other methods; there are many function evaluations invohted in each iteration of 0GB?. Even so, (JGBP has been shown to be one of the fastest batch training algorithms for multilayer networks [Cbar92l.

15

E 12II-11.5
Square1 Error

/ I
/ W
I.

Numerwal Optimization Techn


II
I

I I,,

N
0.5
I
I

Vj!
0

N
0
1

51
I I
I I

II I .I
I
1

.
I II

I
I

II

I
I

0 10
5

IteraiDn Number

102

I
I I
I

i I I
I
I

1, I

I
. I

I.1

I
0

II

II

II

,
,

0 \ * . - L - - -' -

15

10

.7
Figure 12.15 Intermediate Steps of CGBP

5
'

0 w
' . I

15

15

Figure 12.16 Conjugate Gradient Trajectory


experiment with 'CGBF, use the Neural Network Design Demons rations Conjugate Gradient Line Search (nndl 21i) and Conjugate Cradient Backpropagation(nndl2c9).
To

Levenberg-Marquardt Algorithm The Levenberg-Marquardt aigorithm is a variation of Newton's method that was designed for minimizing functions that are sums of squares of other nonlinear functions. This is very well suited to neural network training where the performance index is the mean squared error.

Basic Algorithm Let's begin by considering the form of Newton's method where the performance index

1221

is a sum of squares. Recall from Chapter 9 that Newton's method for optimizing a performance index F (x) is

12- dv 1 (x)

dx1 Jx) =
dv 2(x)

dx 2

(x)dv 1 (x) (12.22)


dx dv 2(x)

dv2(x)
dx2
dV

dx1

(12.23) (12.19)

N( x) dvN(x)dvN(x) =

where

A V2F x *

( )J

and

gEVF(x) If we assume that F (x) is a sum of squares function:


N

(12.20)

F(x) 1=1

v ( ) = v (x)v(x) ,

(12.21)

then the jth element of the gradient would be


dF(x) [VF(x)]
N

dv (x)

2v(x)

The gradient can therefore be written in matrix form:


VF(x)

= 2J T (x)v(x)

where

- dx

dx2

dx

-I

Jacobiaii Matrix is the Jacobian matrix. Next we want to find the Hessian matrix. The k, j element of the Hessian matrix would be - 2F(x) [2F

Idv(x)dv(x)
dx1 + v

d 2v ( x ) l . ( 12 . 2 4 ) ( x ) Jk / - dXkdXJ - 2 dx,

1221

(x)

dX dX k j

The Hessian matrix can then be expressed in matrix form: V2F(x) = 2J1(x)J(x)+2S(x), (12.25) where

12-

Numerical Optinuzation Techn1

S(x) =

i (x) V2v(z)

(12.26)

if we assume that S (x) is small, we can approximate the

Hessian matrix as

V2F(x) a2iT(x)J(x)

(12.27)

If we then substitute Eq. (12.27) and Eq. (12.22) into Eq. (12.19), we obtain Gauss-Newton the Gauss-Newton method:
X
k

= x- [2JT(xk)J(zk)lI2J(xk)v(xl)

(12.28)
Xi- (JT()J()]IJT()()

Note that the advantage of Gauss-Newton over the standard Newton's method is that it does not require calculation of second derivatives. One problem with the Gauss-Newton method is that the matrix H j T may not be invertible. This can be overcome by using the following modification to the approximate Hessian matrix:
G
H+ .tI.

(12.29)

To see how this matrix can be made invertible suppose that the eigenvaJuesandeigenvectorsofH are . . . . . . . . . . A , } and ;} .Then Gz1
=

jH + iIJ ,
t) z .

=
1

H z 1 + jiz1 = A z 1 + p.z 1

(A, +
(12.30)

Therefore the eigenvectors of G are the same as the eigenvectors of H, and the eigenvalues of G are (A1 + .t) . can be made positive definite by increasing L until (X +j) >0 for all I, and therefore the matrix will be invertible.

Levenberg-Marquardt This leads to the Leeenberg-Marquardt algorithm [ScalB5J: = i: [J(Xk)J(Xk)


x -

+pkI]_1JT(xl)v(xk) .

1231)

or
= [JT()J() + 1 )
I J ()

(12.32)

1221 This algorithm has the very useful feature thai as t is increased it approaches the steepest descent algorithm with small fearning rate:

12-

,.

:riations on Baekpropagation

1
X

'7F(x)
)

k+I

k__J
tk

,for large 1k' (12.33)

(x)v(x

while as

is decreased to zero the algorithm becomes

Gauss-Newton. The algorithm begins with j.t, set to some small value (e.g., p = 0.01 ). If a step does not yield a smaller value for F 1x) ,then the step is repeated with k multiplied by some factor i> i (e.g., i = 10). Eventually F (x) should decrease, since we would be taking a small step in the direction of steepest descent. If a step does produce a smaller value for F ( x) I is divided by i3 for the next step, so that the algorithm will approach Gauss-Newton, which should provide faster convergence. The algorithm provides a nice compromise between the speed of Newton's method and the guaranteed convergence of steepest descent. Now let's see how we can apply the Levenberg-Marquardt algorithm to the multilayer network training problem. The performance index for multilayer network training is the mean squared error (see Eq. (11.11)). If each target occurs with equal probability, the mean squared error is proportional to the sum of squared errors over the Q targets in the training set:
Q F(x)

=
q= 1 Q = =

(tq

_a q)T(tq_a q)
Q SM (e q) 2 = N (vi)
2

(12.34)

qI

qrl j = 1

where

is thejth element of the error for the qth

input/target pair. Eq. (12.34) is equivalent to the performance index, Eq. (12.20), for which Levenberg-Marquardt was designed. Therefore it shoald be a straightforward matter to adapt the algorithm for network training. It turns out that this is true in concept, but it does require some care in working out the details.

1221

Jacobian Calcuiatlon

The key step in the Levenberg-Marquardt algorithm is the computation of the Jacobian matrix. To perform this computation we will use a variation of the backpropagation algorithm. Recall that in the standard backpropagationprocedurewe compute the derivatives of the squared errors, withrespect to the weights and biases of the network. To create the Jacobian matrix we need to compute the derivatives of the errors, instead of the derivatives of the squared errors. It is a simple matter conceptually to modify the backpropagation algorithm to compute the elements of the Jacobian matrix. Unfortunately, although the basic concept is simple, the details of the implementation can be a little

,T ae2

[ V, V2 . ..

N]

[e

...

12

...

(12.35) 12 Opttrnizaaon Teehii Numerteo,1

1 2 -2 -3

tricky. For that reason you may want to skim through the rest of this section on your first reading, in order to obtain an overview of the general flow of the presentation, and return later to pick up the details. It may also be helpful to review the development of the backpropagation algorithm in Chapter 11 before proceeding. Before we present the procedure for computing the Jacobian, let's take a closer look at its form (Eq. (12.23)). Note that the error vector is

the parameter vector is


r
xT [, x 2 I 1 ... S, R

b'
S

- N.

i.2

b,(l2.36)
l

N QxSMa1dn=SI(R+l)+S2(SI~l)+...+SM(S +1). 3e 1 , ae 1 ,

Therefore, if we make these substitutions into Eq. (12.23), the Jacobian matrix for multilayer network training carl be written
1 I ,l

d w1 1 J(x) =
wJ , aw 2

(12.37)

wI1

S R ae2

1
S,R

I M
a

ae? ae l

12

a e , , de,2
I 11

a e 12 a e 12

aw!
1, 2

a',
S R

ab'

The terms in this Jacobian matrix can be computed by a simple modification to the backpropagation algorithm. Standard backpropagation calculates terms like

12 . _____- eqTeq

(12.38)

12 .

o n B a ckpro pa ga tio n

(12.39)

For the elements of the Jacobian matrix that are needed for the LevenbergMarquardt algorithm we need to calculate terms like (12.40) 3v e jj [ _ _ 2 ' a, ax Recall from Eq. (11.18) in our derivation of backpropagation that at - at an;" - an where the first term on the right-hand side was defined as the sensitivity at . (12.41) ,n an, The baclqropagation process computed the sensitivities through a recurrence relationship from the last layer backward to the first layer. We can use the same concept to compute the terms needed for the Jacobian matrix Marquardt Sensitivity (Eq. (12.37)) if we define a new Marquarcit sensitivity:
E _L an,q aPiq

(1242)

where,fromEq.(12.35),h =

(q

_l)SM+k.
m

Now we can compute elements of the Jacobian by av, ae, ae, a -m an ' orifx, is abia,
aVh -

-. = I.. d w ; 1 dn q

= s j,1X! = sxa

,(12A3)

Zx1 =

e k q an i q (12.44) db" ab;" The Marquardt sensitivities can be computed through the same recurrence relations as the standard sensitivities (Eq. (11.35)) with one modification at the final layer, which for standard backpropagation is computed with Eq. (11.40). For the Marquardt sensitivities at the final layer we have
= -

Numerical Oplimizaiion Techn

M -M - Vh 5

-M M M

(tk.Ok,q) M U

11
- - f ( s(12.46) = k ) for ! fori~(

M M

Therefore when the input p has been applied to the network and the corresponding network output a has been computed, the Levenberg-Marquardt backpropagation is initialized with
MM. M

1246)

where L (n ) is defined in Eq. (11.34). Each column of the matrix S4 must be backpropagated through the network using Eq. (11.35) to produce one row of the Jacobian matrix. The columns can also be backpropagated together using
=

(12.47)

The total Marquardt sensitivity matrices for each layer are then created by augmenting the matrices computed for each input =
Note that (1 2 .4 8 )

for each input that is presented to the network we will beckpropagate? sensitivity vectors. This is because we are computing the derivatives of each individual error, rather than the derivative of the sum of squares of the errors. For every input applied to the network there will be errors (one for each element of the network output). For each error there will be one row of the Jacobian matrix. After the sensitivities have been backpropagated, the Jacobian matrix is computed using Eq. (12.4) and Eq. (12.44). See Problem P12.5 for a numerical illustration
12-65

of the Jacobian computation. The iterations of the Levenberg-Marquardt backpropagation algorithm LMBP (LMBP) can be summarized as follows: 1. Present all mputs to the network and compute the corresponding network outputs (using Eq. (11.41) and Eq. (11.42)) and the errors e q = tq Compute the sum of squared errors over all inputs, F(x),

I'

.% % n Cts 9e

.
' ,1'ocuo' i.''.
l

,. ztion8 on Baekpropagath:rn

using Eq. (12,34). 2. Compute the Jacobian matrix, Eq. (12.37) Calculate the sensitivities with the recurrence relations Eq. (12.47), after initializing with Eq. (12.46), Augment the individual matrices into the Marquardt sensitivities using Eq. (12.48). Compute the elements of the Jacobian matrix with Eq. (12.43) and Eq. (12.44). 0. Solve Eq. (12.32) to obtain 4.
Xk

of squares is by i, Recompute the sum of squared in step 1, then divide i smaller than that computed + Ax and errors using xk + If the sum of squares kis not this new sum let +1 = k go back to step 1. If reduced, then multiply .t byand go back to step 3.
X

The algorithm is assumed to have converged when the

norm of the gradient, Eq. (12.22), is less than some predetermined value, or when the sum of squares has been reduced to some error goal. To illustrate LMBP, let's apply it to the function approximation problem introduced at the beginning of this chapter. We will begin by looking at the basic Levenberg-Marquardt step. Figure 12,17 illustrates the possible steps the L,MI3P algorithm could take on the first iteration. ,- [ J T J J J T i
I I'

I I I
Ii i

10

I
5 15 15

:1:1 H
i

II I

Figure 12.17 Levenberg-Marquarclt Step The black arrow represents the direction taken for small , which corresponds to the (]uss-Newton direction. The blue arrow represents the direction taken for large tk' which corresponds to the steepest descent direction. (This was the initial direction taken by all of the previous algorithms discussed.) The blue curve represents the LevenbergMarquardt step for all intermediate values of p . Note that as is increased the algorithm moves toward a small step in the direction of steepest descent. This guarantees that the algorithm will
12-67

always be able to reduce the sum of squares at each iteration.

12-68

Num.erical Optimization Techn

Figure 12.18 shows the path of the LMBP trajectory to convergence, with

12

Figure 12.18 LMBP Trajectory


To experiment with the LM'BP algorithm, use the Neural Network Design Demonstrations Marquardt Step (nndl2. ) arid Marquardt Backprcipagation (usa 121).
1.5

The key drawback of the LMBP algorithm is the sthragrequireinent. The algorithm must store the approximate Hessian matrix J J . This is ann x a matrix, where n is the number of parameters (weights and biases) in the network. Recall that the other methods discussed need only store the gradient, which is an n-dimensional vector. When the number of parameters is very large, it may be impractical to use the Levenberg-Marquardt algorithm. (What constitutes "very large' depends on the available memory on your computer, but typically a few thousand parameters is an upper limit.)

12-69

,. iations on Buckprupagathin
I

Summary of Results
Heuristic Variations of Backpropagation
Batching
The parameters are updated ouly after the entire training set has been presented. The gradients calculated for each training example are averaged together to produce a more accurate estimate of the gradient. (If the training set is complete, i.e., covers all possible input/output pairs, then the gradient estimate will be exact.)

Backpropagation with Momentum (MOBP)


fl m

(k) = yW (k- 1)(l- y)os (a


b'k) =Ab'(k- 1)-1---y)us"

Variable Learning Rate Backpropagation (VLBP)


1. If the squared error (over the entire training set) increases by more than seIne set percentage (typically one to five percent) after a weight update, then the weight update is discarded, the learning rate is multiplied by some factor p < , and the momentum coefficient y (if it is used) is set to zero. 2. if the squared error decreases alter a weight update, then the weight update is accepted and the learning rate is multiplied by some factor r 1 If y has been previously set to zero, it is reset to its origuial value.
3. If the squared error increases by less than , then the

weight update is accepted but the learning rate and the momentum coefficient are Unchanged.

12-70

Numerical Optimization Techniques


Conjugate Gradient Interval Location

Interval Reduction (Golden Section Search) t = 0.618 Set c 1 = a 1 + ( 1 - ) ( b


-a1) , F

= F(c)

d1 = t ' 1- I - t ) ( b 1- a ) Fork= 1,2,.. repeat If F < P then =


F = F; F =
F

= F (d ] )

k' h+1

d i.;

= =

(ck+!) )

k;

k + I

b k

; Ck+ =

d k

d kl = ' k l

(l-t)

b + I

k + !

)
F d

; Fd =

Set

(dk+!)

end else end until b k I - a k * < to! Set

1 2 7 1

ay. = -

ae. = awr1 rn-I

ae.

anT

= i, h

3(x) =
-m
s
1

ax,

anq

Xa

j q

forweightx

Levenberg-Marquardt Rackpropagation (LMBP)


= J (x ) +
i

ii _ 1J T( x )

(x )
i

T V =

[V V

VN] = [e r 1

11

21

e
l

12

1
12

b'.b'

Mi

= [ X

tW1 W

S,R

bj

N ae

x? andn = S'(R-i-I) +52(S'-4-1)


a e

1 1

1 1

ae,

a w , a e

a w ,

a e

a e

a w ,

SR
a e a e

a w ,

aWSlR
ae, a e , a e

ab
1

aw

I 2
.

a w

S,R

ab

= ,h

for bias x,

l)

SM+k anjq

(Marquardt Sensitivity) where h = (q anjq

T-m+
m
rn III

= F (flq) (Wm )

12-72

rn

[rn

1273

12
Levenberg-Marquardt Iterations 1.Present afl inputs to the network and compute the corresponding network outputs (using Eq. (11.41) and Eq. (11.42)) and the errors eq = tq - Compute the sum of squared errors over all inputs, F (x), using Eq. (12.34). 0.Compute the Jacobian matrix, Eq. (12.37). Calculate the sensitivities with the recurrence relations Eq. (12.47), after initializing with Eq. (12.46). Augment the individual matrices into the Marquardt sensitivities using Eq. (12.48). Compute the elements of the Jacobian matrix with Eq. (12.43) and Eq. (12.44). Solve Eq. (12.32) to obtain ;. Recompute the sum of squared errors using Xk + Xk. If this new sum of squares is smaller than that computed in step 1, then divide t by i, let Xk. = + and go back to step 1. If the sum of squares is not reduced, then multiply t by and go back to step 3. 3.

12-74

12 .

Solved Problems
P12.1 We want to train the network shown in Figure P12. ion the training set
= [_3] ), (t1= [05]) ! {(P2 =
[2])

(t2=

[i]

start ing from the initi al gues

12-

s
w ( 0 ) = 0. 4, b ( 0 ) 0. 1 5.

Demonstrate the effect of batching by computing the direction of the initial step for SIM3P with and without batching.
I n p u t L o g S i g m

12 .

o i d L a y e r w
a

1 a= Iogsig( wp+b)

Figur e P12.1 Netw ork for Probl em P12. 1 Let's begin by computing the direction of the initial step if batching is not used. hi this case the first step is computed from the first input/target pair. The forward arid backpropagation steps are

12-

a = logsi(wp+b) = 0.2592 I +exp (-(0.4 (-3) +0.15) ) e = ta = 0. 5 0. 2 5 9 2 0. 2 4 0 8

s = -2f(n)e = - 2a(J -a)e = -2


(0.2592) (10.2592)0.2408 = -00925. The

direction of the initial step is the negative of the gradient. For the weight this will be s p =(-

12 .

0. 09 25 ) (3) = -0, 27 74 . For the bias we have s = ( .

0 . 0 9 2 5 ) = 0 . 0 9 2 5 .

12
Therefore the direction of the initial step in the (w, b) plane would be
-0.27741 0.0925]
Now let's consider the initial direction for the batch mode

12-

algorithm. In this case the gradient is found by adding together the individual gradients found from the two sets of mputitarget pairs. For this we need to apply the second input to the network and perform the forward and backpropagation steps:
1

a = logsig(wp+b) (O.4(2) +0.15)) e = t-a = -2f(e)e -2a(1-a)e -0.1122. 1-0.7211 = 0.2789

0.7211 1~exp(-

-2(0.7211) (1-0.7211)0.2789 =

The direction of the step is the negative of the gradient. For the weight this will be
-sp = -(-01122) (2) = 0.2243.

For the bias we have


-s = -(-0.1122) = 0.1122.

The partial gradient for the second input/target pair is therefore


0.22431 0.1122]

If we now add the results from the two input/target pairs we find the direction of the first step of the batch mode SDBP to be
-0.2774 + ro.2243 o.os

1 - 1 [-0.0531] - -0.0265

Lo.! 122 ) - 2[ 0.2047] -0.1023

12-

The results are illustrated in Figure P12.2. The blue circle indicat.es the initial guess. The two blue arrows represent the directions of the partial gradients for each of the two mpntitarget pairs, and the black arrow represents the direction f the total gradient. The function that is plotted is the sum of squared errors for the entire training set. Note that the individual partial gradients can point in quite different directions than the true gradient. However, on the average, over several iterations, the path will generally follow the steepest descent trajectory.

.% %

I234

Ct4an9e

iaiLons on Backpropagat ion II

The relative effectiveness of the batch mode over the incremental approach depends very much on the particular problem. The incremental approach requires less storage, and, if the inputs are presented randomly to the network, the trajectory is stochastic, which makes the algorithm somewhat less likely to be trapped in a local minimum. It may also take longer to converge than the batch mode algorithm.

ho
I
I I

I
.2

-2

Figure P12.2 Effect of Batching in Problem P12.1

P12.2 In Chapter 9 we proved that the steepest descent algorithm, when applied to a quadratic function, would be stable if the learning rate was less than 2 divided by the maximum eigenvalue of the Hessian matrix. Show that if a momentum term is added to the steepest descent algorithm there will always be a momentum coefficient that will make the algorithm stable, regardless of the learning rate. Follow the format of the proof on page 9-8.
The standard steepest descent algorithm is = -aVF(xi = If we add momentum this becomes = Recall from Chapter 8 that the quadratic function has the form
T

-ag,

F(x) = xAx+d

x+c,

and the gradient of the quadratic function is VF(x) = Ax+d.

If we now insert this expression into our expression for the steepest descent algorithm with momentum we obtain
=

TAxkl
Ax X

l-

Y)a(Axk+d). this can be rewritten

Using the definition

k + - k

or
X

k+!

Now define a new vector


- I

The momentum variation of steepest descent can then be written


[o L-7' t ( 1+ 7) -( 1- Y ) a A1 i

Wi+v.

L-u-ud]

This is a linear dynamic system that will be stable if the eigenvalues of W are less than one in magnItude. We will find the eigenvalues of W in stages. First, rewrite W as whereT= [(l+y)I-(!-y)aA]. L -y t j T The eigenvalues and eigenvectors of W should satisi 10 i i
Wzw =
?WW

or

L _ ii
and -yi-+-Tz

This means that


W1$

At this point we will choose z to be an eigenvector of the matrix T, with corresponding eigenvalue ?. (if this choice is
12-83

not appropriate it will lead to a contradiction.) Therefore the previous equations become

12-36

,.

2i at i o , u c on Ba c k p r o p a g a t w n

AWW

and - yz + Atz' = A W Z W

If we substitute the first equation into the second equation we find


w
Z

rw

W4
z

+Az 2 = A

or [ ( v ) 2 A t ( W ) + y ] z = 0 .

Therefore for each eigenvalue X of T there will be two elgenvalues Aw of W that are roots of the quadratic equation = o. From the quadratic formula we have = For the algorithm to be stable the magnitude of each eigenvalue must be less than 1. We will show that there always exists some range of y for which this is true.
Note
I.

that if the eigenvalues Aw are complex then their magnitude will be r.


w '(X')

4y- (A') 2 4 = .

show later that A' is real.) Since y is between 0 and 1, the magnitude of the eigenvalue must be less than 1. It remains to show that
(This is true oniy for real A'. We will

there exists some range of y for which all of the eigenvalues are complex.

In order for Aw to be complex we must have ( A 5 2 - 4 7 < O or JA'P <2,J. Let's now consider the eigenvalues A' of T. These eigenvalues can be expressed in terms of the eigenvahies of A. Let {A1, A2, , A } and {z1, z. ,..., zj be the elgenvalues and eigenvectors of the ifessian matrix. Then

12-85

Tz,

f(l+y)1-(l-7)aA]z1 (l+y)z. (I

l+7);-(l-y)aAz,

) cZAZ = {(1+y) ( l 7 ) a A } Z

12-36

I234

Therefore the eigenvectors ofT are the same as the eigenvectors of A ,and the eigenvalues of I are A= {(l+)-(ly)a}. (Note that is real, since ? a and A for symmetric A are real.) Therefore1 in order for W to be complex we must have
Ai<2
or (1ty)-(1-)aA <2.

For y = 1 both sides of the inequality will equal 2. The function on the right of the inequality, as a function of y, has a slope of 1 at y = I The function on the left of the inequality has a slope of 1 + aA. Since the eigenvalues of the Hessian will be positive real numbers if the function has a strong minimum, and the learning rate is a positive number, this slope must be greater than 1. This shows that the inequality will always hold for y close enough to 1. To summarize the results, we have shown that if a momentum term is added to the steepest descent algorithm on a quadratic function, then there will always be a momentum coefficient that will make the algorithm stable, regardless of the learning rate. In addition we have shown that if'y is close enough to 1, then the magnitudes of the eigenvalues of W will be .,J. It can be shown (see [Brog9l]) that the magnitudes of the eigenvalues determine how fast the algorithm will converge. The smaller the magnitude, the faster the convergence. As the magnitude approaches I, the convergence time increases. We can demonstrate these results using the example on page 97. There we showed that thepteepest descent algorithm, when applied to the function F(x) = x 1 +25x21 wasunstableforalearningratea~0.4.InFigure Pl2.3 we see the steepest descent trajectory (with momentum) with a O.C)41 and T = 0.2. Compare this trajectory with Figure 9.3, which uses the same learning rate but no momentum.

12-

,.

ahons or &ukpropagation

Figure P12.3 Trajectoryfora = 0.041 andy = 0.2

P12.3 Execute three iterations of the variable learning rate algorithm on the following function (from the Chapter 9 example on page 9-7)
F(x)

4+2sx,

starting from the Initial guess


N

0.5 and use the following values for the algorithm parameters:
z = O . O 5 , y = 0 . 2 = f . 5 p = 0 . 5 , = 5 i . The first step is to evaluate the function at the initial

guess:
F(x) VF(x) =

[0.5 o.][

= 6,5.

The next step is to find the gradient: (x) xl dx2 = 2x 50x

]1

if we evaluate the gradient at the initial guess we find:

12-89

I234

VF

(x)I

With the initial learning rate of a = 0.05 , the tentative first step of the algorithm is
y x -

1-T)a = 0.20

-0.8(0,05)
li

= 0 2 5 J 1 = xo+xo = 0.5 + _o.i1 = ro.461. 0.5 - 1 ] L-0.5]

To verify that this

is a valid step we must test the value of the function at this new point:
F(x)

20 1 O 50] 1 -

2010.461

This is less than F (x0) . Therefore this tentative step is accepted and the leaniing rate is increased: = 0 . 4 6 -

0 . 5 The tentative second step of the algorithm is = yAx 0 - ( 1 -7)ag1 = 0.2


[_0.0J_o.s

(0.075) [o.j =

{ _ o.o637j
-25 1 .3
- 1Q,41
x

+ 1-0.06321 1O391

=x+Ax1 - [ _o . L
1.3

0.8]

We evaluate the function at this point:

F(x2) = (x2) I

= [0.39680.8J Lo

1 112 OlrO.3%81 ol

s o j [ 0.8 16.157. L o so]

Sincethis is more than 5% larger than F(x1) ,we reject this step, reduce the learning rate and set the momentum coefficient to zero.
12-91

I234

= x,F(x 2 ) = F(x 1 )
= 6.4616,a=

pa=

0.5(0.075) =0.0375,y=O

I234

ztions on Bai,kpropagation

Now a new tentative step is computed (momentum is zero).


= -g2 = -10.0375) [0.92] = -25 0.0345 0.937 5 0.4255 20 0.4375] 0 50 0.4375 0 .4 2 5 5 = 4.96 6

[ [0.46]
I

-0.5 i 1

0 0.0345
0.9375

_1 F x)

T2 05 0

This is less than F (x2) Therefore

this step is accepted, the momentum is reset to its original value, and the learning rate is increased. = x,y=0.2,a=a= 1.5(0.0375) = 0.05625 This completes the third iteration. P12.4 Recall the example from Chapter 9 that we used to demonstrate the conjugate gradient algorithm (page 9-18) with initial guess
F(x) =I x r 2 l x , 2 12

= 0.8
-0.25

Perform one iteration of the conjugate gradient algorithm. For the linear minimization use interval location by function evaluation and interval reduction by the Golden Section search. The gradient of this function is
VF(x) =

2x1
+ x1

-.2x As with steepest descent, the first2search direction for the conjugate gradient algorithm is the negative of the

,.

12-

gradient;

F (=) x0 +a0 p 0 = 1 -0.25] b1 = ? = 0.075, F (b 1 )


=

0.8

I234
= 0.5025, ( -1.35 = 0.3721 -0.3 ,
T

F
p

= -VF(x)

-1.351
XX

-0.3]

For the first iteration we need to minimize F (x) along the line
o.g

-0.25

The first step is interval location. Assume that the initial step size is = 0.075. Then the intervai location would proceed as follows:

b2 - 2 - 0.15,F(b2)

Fir o , 8 1 0,15r_1.35i L-o.3i


1

0,2678 b3 = 4 = 0.3, F(b3) = F 0.8 +0.3 % -0.25 8 = 0.6, F(b4)


F

= 0.1373

-.0.3 J = 0.1893.

o . s ] + o .or13

L-J (0382) (0.6-0.15) Since the function increases between two consecutive = evaluations we know that the minimum must occur in the interval [0.15, 0.6]. This process0.3219, is illustrated by (0.382) the open blue circles in Figure P12.4, and the final interval is indicated by the large open black circles. (0.6-0.15) = The next step in the linear minimization is interval 0.4281,

-0.25]

reduction using the Golden Section search. This proceeds as follows:

= a 1 +(1'-'t) (b 1 -a) = 0.15+

12- I,.

b- (1-'r) (! 1-a1)

0.6-

= 02678, Fb = 0.1893, F = 0.1270, 0.1085 Since F >


F d

, we have

12-

",ations oa

Backpropagation

= 0.3219, b = = 0.6, c, = = 0.4281

d. , = b, - (1

-t) (b-a,) = 06(0382) (0.60.3219) = 04938, = F, 0.1270, =


F

0.1085. F = F(d.,) 0.1232. This time F<


F
d

therefore
a =a 2 =

,.

12.

0.3219,b = d, = 3 0,4938,d = e. = 0.4281,


= a+ (l-t) ( b a.
F

)=

0.3219+ (0382) (0.49380.3219) = 0.3876,


F
h

= F=

0.1232, F = =

0.1085, F = F(c) = 0.1094.

This routine
continues until

<to!. The black dots in Figure P12.4 indicate the location of the new interior points, one for each iteration of the proceduie. The final point is indicated by a blue dot. Compare this result with the first iteration shown in Figure

bk

+ F - k I

12-

9.10. Figure P12.4 Linear Minim izatio n Exam ple P12.5 To illustrate the computation of the Jacobian matrix for the Lev. enbergMarquardt method, consider using the network of Figure P12.5 for function approximation. The network transfer functions are chosen to be
f ' ( n ) = ( a )

,.

12.

' f i n ) =

n .
T h e r e f o r e t h e i r d e r i

12-

v a t i v e s a r e f
t

)
= 2 n , f
(

n ) = 1 .

Assume that the

training set consists of

12-

((p1 [i]),(t [i]fl. { ( p 2 [ 2 ] ) . ( t 2 = [2]


and that the parameters are initialized to

= [i] ,b'

[o]

= [,

= [i]

Find the Jacobian matrix for the first step of the Levenberg-Marquardt methoL

rTh(____
WI1 . p.___

Input

Layer 1

Layer 2

[Li
I
. ________________ a 2 =f2(w2a1+b2) a' =f'(w'p+b')

._____

Figure P12.6 Two-Layer Network for LMBP Demonstration


The first step is to propagate the inputs through the network and compute

the errors.

= pi

L i]

n:W'a?+b1= [i][i][o] = [i ] ,a = f ' ( n ) =


([] 2

)2
([2][i]+[i]) [3],a=f2n

{]

n1 2 a+b 2 W

([3]) [3] [

(t-a) ([1]-[3]) = [-2]


a

2 = [2]

1 2

10

= [i][2]+{o] = [2],a=f1m = ([21)2=

[1

12.

n = W2a+b2 [9]

([][]+[])[], a = f2(n) = ([9]) =


= (t1-a) = ([2]-[])

= [_i]

12-2

= -1

.iations on Bachpropagation The next step is to initialize and backpropagate the Marquardt sensitivities using Eq. 112.46) and Eq. (12.47). [ [] = 1n:)(w2)rs [ 2 n j [ 2 ] [ -l ] = [2w] [2][-l] -F2(n) = - [ i ]

=- ]( i)(w2) rs2
s'=

[2]r][] = [2(2)]

[2] [-i]= [-8]

L=[4812

[ ] = [-1-1]

We can now compute the Jacobian matrix using Eq. (1243), Eq. (12.44) and Eq. (12.37).

a a v , a-v, a , - ae 1 ,
1

ae

11

ae, , ae 1 ,
ab

J(x)
2

a , ax, ax a , ab: awz,


4

ax, ax ax ax a w , a b a w 1 a b
2 3 4

av a v 2 a v 2 aa e 1 2 a e 1 , a e , 2 a e

a v , !2! = - 1 an, x i-. an:,1


,1

0 = - i I

- ae - a x _____________________________________________ - 2 - a b : - a n , a b
-.2 2
1

..

an

:,

=
1, 3

- a,

dn1

-2 =

an,,

-2
2

I
= s
11

)<a

11

(- 1) (1) = -1
1 -2

ax, - ;: a w 1

a w 1,

12-

av,

ae,

= , ab

an

-2

,. 12-

zXa?2 = ( - 8) (2) = -16

av2 - e1

[J}21 =
1 , 2 a a

--x-----

s
1,2

12

- an 2 awi,
-1 ant, -i =

IJ]2,2
-

- an 2 1,2 = 2

ab
2 fl i , 2 2 i

EJ1 2 , 3 =
3 - ae12 = ae 1 2 2.4 =

e] 2

S 12X

- --

= S12XaJ2 = {-1) (4) = -4 2 2 n 1 ,2 .2 2 x -- = = I,2

Therefore the Jacobian matrix is = -4-4-1-1 -16 -8 -4 -1

12-

One of the major problems with the basic backpropagation algorithm (steepest descent backpropagation - SDBP) has been the long training times. It is not feasible to use ST}BP on practical problems, because it can take weeks to train a network, even on a large computer. Since backpropagation was first popularized, there has been considerable work on methods to accelerate the convergence of the algorithm. In this chapter we have discussed the reasons for the slow convergence of SDBP and have presented several tecimiques for improving the performance of the algorithm. The techniques for speeding up convergence have fallen into two main categories: heuristic methods and standard numerical optimization methods. We have discussed two heuristic methods: momentum (MOBP) and variable learning rate (VLBP). MOBP is simple to implement, can be used in batch mode or incremental mode and is significantly faster than SDBP. lt does require the selection of the momentum coefficient, but y is limited to the range [0, 1] and the algorithm is not extremely sensitive to this choice. The VLBP algorithm is faster than MOBP but must be used in batch mode. For this reason it requires more storage. VLBP also requires the selection of a total of five parameters. The algorithm is reasonably robust, but the choice of the parameters can affect the convergence speed and is problem dependent. We also presented two standard numerical optimization techniques: conjugate gradient (CGBP) and LevenbergMarquardt (LMBP). CGBP is generally faster than VLBP. It is a batch mode algorithm, which requires a linear search at each iteration, but its storage requirements are not significantly different than VLBP. There are many variations of the conjugate gradient algorithm proposed for neural network applications. We have presented only one. The LMBP algorithm is the fastest algorithm that we have tested for training multilayer networks of moderate size, even though it requires a matrix inversion at each iteration. It requires that two parameters be selected, but the algorithm does not appear to be sensitive to this lection. The main

12-

drawback of LMBP is the storage requirement. The J J matrix, which must be inverted, is n x a, where n is the total number of weighta and biases in the network. If the network has more than a few thousand parameters, the Llv1BP algorithm becomes impractical on current machines. There are many other variations on backpropagation that have not been discussed in this chapter. Some references to other techniques are given in Chapter 19.

12-

Further Reading
[Barn92} E. Barnard, "Optimiz ation for training neural nets,"
IEEE

Trans. on Neural Networks, vol. 3, no. 2, pp. 232240, 1992.

A number of optimizat ion algorithm s that have promise for neural network training are discusse d in this paper. [Batt92} R. Battiti, "Firstand secondorder methods for learning: Between steepest descent and Newton1 s

method,"
Neural

12-

Computa tion, voi, 4, no. 2, 141pp. 166, 1992. This paper is an excellent survey of the current optimization algorith ms that are suitable for neural network training. LChar92] C. Charala mbous, "Conjug ate gradient algorith m for eflident training of artificial neural networks ," lEE
Proceed-

ings, vol. 139, no. 3, pp. 301-310, 1992. This paper explains how the conjugat e gradient

algorith m can be used to train multilaye r networks . Compari sons are made to other training algorith ms. [Fah188] S. , Fahiman, "Fasterlearning variations on backpropagation: An empirical study," In B. Touretsk y, G. thnton & T. Sejnowsk i, eds.,
Proceedings of the 1988 Connectionis t Models
Summer School, San

12-

Mateo, CA: Morgan Kaufmann, pp. 3851, 1988. The QuickPro p algorithm , which is described in this paper, is

one of the more popular heuristic modificati ons to backpropagati on. It assumes that the error curve can be approximated by a parabola, and that the effect of each weight can be considere d independ ently. QuickProp provides significant speedup over standard backprop agation on many problems . [HaMe94] M. T. Hagan and M. Menha.j, "Training feedforwa rd networks with the Marquard t algorithm ," IEEE
Transactions on Neural

12-

12-

Networks, vol. 5, no. 6, 1994.

This paper describes the use of the Levenber gMarquard t algorithm for training inultilaye r networks and compares the performa nce of the algorithm with variable learning rate backprop agation and conjugat e gradient.
The Leven-

bergMarquard t algorithm is faster, but requires more storage.

12-

on Bachpropagation

[Jaco88J

TI. A.

Jacobs. "Increased rates of convergence through learning rate adaptation," Neural Networks, vol. 1, no. 4, pp. 295-308, 1988. This is another early paper discussing the use of variable learning rate backpropagation. The procedure described here is called the deltabar-delta learning rule, in which each network parameter has its own learning rate that varies at each iteration,

[NgWi9O} D. Nguyen and B. Widrow, "Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights," Proceedings of the IJCNN, vol. 3, pp. 21-26, July 1990. This paper describes a procedure for setting the initial weights and biases for the backpropagation algorithm. It uses the shape of the sigmoid transfer function and the range of the input variables to determine how large the weights should be, and then uses the biases to center the sigmoids in the operating region. The convergence of backpropagation is improved significantly by this procedure. [RiIr9O} A. K Rigler, J. M. Irvine and T. P. Vogl, Rescaling of variables in back propagation learning," Neural Networks, vol. 3, no. 5, pp. 561-573, 1990. This paper notes that the derivative of a sigmoid function is very small on the tails. This means that the elements of the gradient associated with the first few layers will generally be smaller that those associated with the last layer. The terms in the gradient are then scaled to equalize them. LScal8} L. E. Scales, introduction to
Optimization. Non-Linear

New York: Springer-Verlag, 1985. Scales has written a very readab!e text describing the major optimization algorithms.

The book emphasizes methods of optimization rather than existence theorems and proofs of convergence. Algorithms are presented with intuitive explanations, along with fflustrative figures and examples. Pseudocode is presented for most algorithms.

1249

[Shan9O]

1). F. Shanno, "Recent advances in numerical techmques for large-scale optimization," Neural Networks for Control, Miller, Sutton and Werbos, eds., Cambridge MA MIT Press, 1990. This paper discusses some conjugate gradient and quasi-Newton optimization algorithms that could be used for neural network training.

[To1190]

T. Tollenaere, "SuperSAB: Fast adaptive back propagation with good scaling properties," Neural Networks, vol. 3, no. 5, p p -5G 1 - 5 7 3 , 1 99 0 . This paper presents a variable learning rate backpropagatioii algorithm in which different learning rates are used for each weight.

[VoMaS8] T. P. Vogl, J. K. Mangis, A. K. Zigler, W. T. Zink and D. L. Alkon, "Accelerating the convergence of the backpropagation method," Biological Cyberrzetics., vol. 59, pp. 256-264, Sept. 1988. This was one of the first papers to introduce several heuristic techniques for accelerating the convergence of back-propagation. It included batching, momentum and variable learning rate.

1249

E12.1 We want to train the network shown in Figure E12.1 on the training set {(p [_21)(t1 = [ o j ) I {(p2
[2]''2=

[i]i

where each pair is equally likely to occur. Write a MATLAB M-file to create a contour plot for the mean squared error performance index.
Input Log-Sigmoid Layer

Figure E12.1 Network for Exercise E12.1

E12.2 Demonstrate the effect of batching by computing the direction of the initial step for SDBP with and without batching for the problem described in Exercise E12.1, starting from the initial guess w(0) = 0, b(0) 0.5. E12.3 Recall the quadratic function used in Problem P9.1: 1 TrlO 61 F(x) = -x I IX+[44]x. 2 L-610] We want to use the steepest descent algorithm with momentum to miniinize this function. i. Suppose that the learning rate is a = 0.2. Find a value for the momentum coefficient for which the algorithm will be stable. Use the ideas presented in Problem P12.2. ii. Suppose that the learning rate is a = 20. Find a value for the mementuin coefficient y for which the algorithm will be stable.

12.117

iii. Write a MATLAB program to plot the trajectories of the algorithm for the a and 'y values of both part U) and part (ii) on the contour plot of F (x) starting from the initial guess xo
-ii 2.5]

E12.4 For the function of Exercise E12.3, perform three iterations of the variable learning rate algorithm, with initial guess
x

Plot the aigorithm trajectory on a contour plot of F (x) - Use the algorithm parameters
a = O . 4 , T = O . l , = l . 5 , p O . 5 5 % .

E125 For the function of Exercise E12.3, perform one iteration of the conjugate gradient algorithm, with initial guess xG =

- 2 ,5

For the linear minimization use interval location by function evaluation and interval reduction by the Golden Section search. Plot the path of the search on a contour plot of F (x). E12.6 We want to use the network of Figure E12.2 to approximate the function g(p) sinp)for-2~~2. The initial network parameters are chosen to be w'(0) = [ -o.2 _0.41j
12.118

1-t-

IO.48l

= [ o.o9_o.17],b2(o)

[0,481-

Lo.l3J

To create the training set we sample the function g (p) at

the points p 1 and p = 0. Find the Jacobian matrix for the first step of the LMBP algo-. rithm. (Some of the information you will need has been computed in the example starting on page 11-14.)

12-119

,.

.a1ions on &zekpropagaiion

Input Log-Sigmoid Layer

Linear Layer

al=Iopig(W1p+bI)
purelin(W2a1+b)

a2=

Figure E12.2 Network for Exercise E12.6 E12.7 Show that for a linear network the LMBP algorithm will converge to an optimum solution in one iteration if t = 0. E128 In Exercise E11.11 you wrote a MATLAB program to implement the SDBP algorithm for the 1-2-1 network shown in Figure E12.2, and trained the network to approximate the function

g(p)

!+sin(p)far_2~p~2.

Repeat this exercise, modifying your program to use the training procedures discussed in this chapter: batch mode SL)BP, MOBP, VLBP, CGBP and LMBT'. Compare the convergence results of the various methods.

12.120

You might also like