Conjugate Gradient Method: Com S 477/577 Nov 6, 2007

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Conjugate Gradient Method

Com S 477/577 Nov 6, 2007

Introduction

Recall that in steepest descent of nonlinear optimization the steps are along directions that undo some of the progress of the others. The basic idea of the conjugate gradient method is to move in non-interfering directions. Suppose we have just performed a line minimization along the direction u. Then the gradient f at the current point is perpendicular to u, because otherwise we would have been able to move further along u. Next, we should move along some direction v. In steepest descent we let v = f . In the conjugate gradient method we perturb f by adding to it some direction to become v. We want to choose v in such a way that it does not undo our minimization along u. In other words, we want f to be perpendicular to u before and after we move along v. At least locally we want that the change in f be perpendicular to u. Now observe that a small change x in x will produce a small change in f given by f Hf x. Our idea of moving along non-interfering directions leads to the condition uT f = 0, And the next move should be along the direction v such that uT Hf v = 0. (1)

Even though v is not orthogonal to u, it is Hf -orthogonal to u. Of course, we must worry about a slight technicality. The connection between x and (f ) in terms of the Hessian Hf is a dierential relationship. We here use it for nite motions to the extent that Taylors approximation of order 2 is valid. Suppose we expand f around a point y: 1 f (x + y) f (y) + f (y)T x + xT Hf x. 2 Thus f locally looks like a quadratic. If we focus on quadratics, then the Hessian Hf does not vary as we move along directions u and v. Thus the condition (1) makes sense. With this reasoning as background, one develops the conjugate gradient method for quadratic functions formed from symmetric positive denite matrices. For such quadratic functions, the conjugate gradient method converges to the unique global minimum in at most n steps, by moving along successive non-interfering directions. For general functions, the conjugate gradient method repeatedly executes packages of n steps. Once near a local minimum, the algorithm converges quadratically. 1

Conjugate Direction

Given a symmetric matrix Q, two vectors d1 and d2 are said to be Q-orthogonal, or conjugate with respect to Q, if dT Qd2 = 0. A nite set of vectors d0 , d1 , . . . , dk is said to be a Q-orthogonal set if 1 dT Qdj = 0 for all i = j. i Proposition 1 If Q is symmetric positive denite and the vectors d0 , d1 , . . . , dk are Q-orthogonal to each other, then they are linearly independent. Proof Suppose there exist constants i , i = 0, 1, . . . , k such that 0 d0 + + k dk = 0. Multiplying by Q and taking the scalar product with di yields i dT Qdi = 0, i for i = 0, 1, . . . , k.

But dT Qdi > 0 given the positive deniteness of Q, we have i = 0 for i = 0, . . . , k. i Let us investigate just why the notion of Q-orthogonality is useful in the solution of the following problem 1 T min x Qx + bT x, (2) 2 where Q is symmetric positive denite. The unique solution to this problem is also the unique solution to the equation Qx + b = 0. (3) Suppose that d0 , . . . , dn1 are n non-zero Q-orthogonal vectors. By the previous proposition, these vectors are independent. Therefore they form a Q-orthogonal basis for Rn . Let x be the unique solution to (2) or (3). We can write x = 0 d0 + + n1 dn1 , for some real numbers 0 , . . . , n1 . Plugging the above into (3) yields Q(0 d0 + + n1 dn1 ) + b = 0 and dT Q(0 d0 + + n1 dn1 ) + dT b = 0. i i Due to the Q-orthogonality of the di s, we can solve for these coecients i = Thus we obtain the explicit formula
n1

dT b i . dT Qdi i

x Notice two important facts:

=
i=0

dT b i di . dT Qdi i

1. By choosing d1 , . . . , dn to be Q-orthogonal we can determine the coecients 1 , . . . , n easily, using inner products. 2. The approach is possible for any positive-denite matrix. In particular, we could simply n1 have chosen the di s to be orthogonal (i.e., I-orthogonal). Then x = i=0 (dT x /dT di )di . i i However, by choosing the di s to be Q-orthogonal we can determine the coecients i s in terms of the known quantity b, not the unknown quantity x . How does this generate an algorithm? One view is purely algebraic, namely, we compute 0 , 1 , . . . , n1 . Another view is to think of these computations as an n-step search. We start the search at the origin. On the ith iteration we move in the direction di by i . After n iterations, we have found the unique minimum x , as we will see shortly. But two important issues remain: 1. How do we construct the Q-orthogonal vectors d0 , . . . , dn1 ? 2. How do we deal with the reality that the matrix Q = Hf is often unknown?

Properties of Descent

Let Q be a symmetric and positive denite matrix. We dene Bk as the subspace of Rn spanned by a set of Q-orthogonal vectors d0 , d1 , . . . , dk1 ; or for short, Bk = span{d0 , d1 , . . . , dk1 }. Theorem 2 (Expanding Subspace) Let {d0 , . . . , dn1 } be a set of nonzero Q-orthogonal vectors in Rn . For any x0 Rn , consider the sequence {xk } generated by the rule xk+1 = xk + k dk , where, writing g k = Qxk + b, k = The following statements hold: (i) the sequence {xk } converges to the unique solution x of Qx + b = 0 after n steps. In other words, xn = x minimizes the function f (x) = 1 xT Qx + bT x. 2 (ii) xk+1 minimizes the same function f (x) on the line x = xk + dk , < < as well as on the linear variety x0 + Bk+1 . Proof To prove (i), we make use of the linear independence of the dj s. Notice that x x0 = 0 d0 + + n1 dn1 for some 0 , . . . , n1 . We multiply both sides of the equation by Q and take the inner product with dk , yielding dT Q(x x0 ) . (4) k = k T dk Qdk 3 g T dk k . dT Qdk k

Now we use induction to show that k dened in (4) equals g T dk /dT Qdk . Suppose this is k k true for 0 , . . . , k1 . Following the iterative steps from x0 up to xk we have xk x0 = 0 d0 + + k1 dk1 . By the Q-orthogonality of the dj s it follows that dT Q(xk x0 ) = 0. k Substituting the above into (4) we obtain that k = = = = dT Q(x xk + xk x0 ) k dT Qdk k dT Q(x xk ) k dT Qdk k dT (Qx Qxk ) k dT Qdk k

dT (b Qxk ) k dT Qdk k Td g k = Tk . dk Qdk To prove (ii), we show that xk+1 minimizes f over the linear variety x0 + Bk+1 , which contains the line x = xk + dk . Since the quadratic function f is strictly convex, a local minimum is also a global one. So the conclusion will hold if it can be shown that the gradient g k+1 is orthogonal to Bk+1 , that is, if the gradient is orthogonal to d0 , d1 , . . . , dk .1 We prove this by induction. The hypothesis is true for k = 0 since B0 is empty. Assume that g k Bk . We have g k+1 = Qxk+1 + b = Q(xk + k dk ) + b = Qxk + b + k Qdk = g k + k Qdk , and hence by denition of k dT g k+1 = dT g k + k dT Qdk k k k = dT g k k g T dk T k dk Qdk dT Qdk k

= dT g k g T dk k k = 0. Also it holds that dT g k+1 = dT g k + k dT Qdk , i i i for i < k. The rst term on the right hand side of the above equation vanishes due to the induction hypothesis, while the second term vanishes by the Q-orthogonality of the di s. Thus gk+1 Bk+1 .
1

Otherwise, we can always move along some direction in Bk+1 to decrease f .

Corollary 3 The gradients g k , k = 0, 1, . . . , n, satisfy g k Bk . This theorem tells us that the conjugate gradient algorithm really is a generalization of steepest descent. Each step of adding k dk to the previous estimate is the same as doing a line minimization along the direction of dk . Furthermore, the oset k dk does not undo previous progress, that is, the minimization is in fact a minimization over x0 + Bk+1 .
x0+ Bk+1

x k+1 dk

xk

gk+1

d k- 1 xk- 1

So the Bk s form a sequence of subspace with Bk Bk+1 . Because xk minimizes f over x0 + Bk , it is clear that xn minimizes f over the entire space Rn = Bn .

Conjugate Gradient Algorithm

The conjugate gradient algorithm selects the successive direction vectors as a conjugate version of the successive gradients obtained as the method progresses. Thus, the directions are not specied beforehand, but rather are determined sequentially at each step of the iteration. At step k one evaluates the current negative gradient vector and adds to it a linear combination of the previous direction vectors to obtain a new conjugate direction vector along which to move. There are three primary advantages to this method of direction selection. First, unless the solution is attained in less than n steps, the gradient is always nonzero and linearly independent of all previous direction vectors. Indeed, as the corollary states, the gradient g k is orthogonal to the subspace Bk generated by d0 , d1 , . . . , dk1 . If the solution is reached before n steps are taken, the gradient vanishes and the process terminates. Second, a more important advantage of the conjugate gradient method is the especially simple formula that is used to determine the new direction vector. This simplicity makes the method only slightly more complicated than steepest descent. Third, because the directions are based on the gradients, the process makes good uniform progress toward the solution at every step. This is in contrast to the situation for arbitrary sequences of conjugate directions in which progress may be slight until the nal few steps. Although for the pure quadratic problem uniform progress is of no great importance, it is important for generalizations to nonquadratic problems.

Conjugate Gradient Algorithm 1. g 0 Qx0 + b 2. d0 g 0 3. for k = 0, . . . , n 1 do g T dk a) k Tk dk Qdk b) xk+1 xk + k dk c) gk+1 Qxk+1 + b g T Qdk d) k k+1 dT Qdk k e) dk+1 g k+1 + k dk 4. return xn Step 3b) when k = 0 is a steepest descent. Each subsequent step moves in a direction that modies the opposite of the current gradient by a factor of the previous direction. Step 3a)e) gives us the Q-orthogonality of the descent vectors d0 , . . . , dn1 . Theorem 4 (Conjugate Gradient Theorem) In the conjugate gradient algorithm, we have that a) span{g 0 , . . . , g k } = span{g 0 , Qg 0 , . . . , Qk g 0 } b) span{d0 , . . . , dk } = span{g 0 , Qg 0 , . . . , Qk g 0 } c) dT Qdi = 0 for all i < k k d) k = e) k = gT gk k dT Qdk k gT g k+1 k+1 gT g k k

For proof of the theorem we refer to [2, pp. 245246]. Part c) of the above theorem states that the di s are Q-orthogonal to each other. Part e) is very important, because it provides us a way to compute k without knowing Q.

Extension to Nonquadratic Problems

How do we compute k without knowing Q? The Expanding Subspace Theorem already gave us the answer a line search. This agrees with the formula in the quadratic case. We can generalize the conjugate gradient algorithm to devise a numerical routine to minimize an arbitrary function f . Here the Hessian of f plays the role of Q. The algorithm executes groups of n search steps. Each step builds a coordinate i di in a search for the minimum x . After n steps, the algorithm resets, using its current x location as a new origin from which to start another n-step search. 6

Fletcher-Reeves Algorithm 1. start at some x0 2. d0 f (x0 ) 3. for k = 0, 1, . . . , n 1 do a) obtain k that minimizes g() = f (xk + dk ) b) xk+1 xk + k dk f (xk+1 ) 2 c) k f (xk ) 2 d) dk+1 f (xk+1 ) + k dk 4. x0 xn 5. go back to step 2 until satised with the results.

To determine k the algorithm employs part e) of Theorem 4. Step 2 ensures that there is at least one descent direction in every n iterations. Steps 3a) and 3b) ensure that no step increases f . Global convergence of the line search methods is established by noting that a pure steepest descent step is taken every n steps and serves as a spacer step. Since the other steps do not increase the objective, and in fact hopefully they decrease it, global convergence is assured. The restarting aspect of the algorithm is important for global convergence analysis, since in general one cannot guarantee that the directions dk generated by the method are descent directions. The local convergence properties of nonquadratic extensions of the conjugate gradient method can be inferred from the quadratic analysis. Assuming that at the solution, x , the Hessian f is positive denite, we expect the asymptotic convergence rate per step to be at least as good as steepest descent, since this is true in the quadratic case. In addition to this bound on the single step rate we expect that the method is of order two with respect to each complete cycle of n step. In other words, since one complete cycle solves a quadratic problem exactly just as Newtons method does in one step, we expect that for general nonquadratic problems there will hold xk+n x c xk x
2

for some c and k = 0, n, 2n, . . .. This can indeed be proved, and of course underlies the original motivation for the method.

Conclusion

Recall that in nding a minimum of f of n variables, we may wish to consider the set of zeros of f = f . In principle, we could apply Newtons method to f , resulting in the following iteration formula: x(m+1) = x(m) Hf x(m) Suppose f is a quadratic function with the form 1 f (x) = c + bT x + xT Ax, 2 7
1

f x(m) .

where A is symmetric positive denite. Then f = b + Ax and Hf = A, and the global minimum of f satises Ax = b. In this case, Newtons method converges in a single step. But for general f , the Hessian Hf often is unknown. To remedy this, there exist methods called Quasi-Newton methods that build (Hf )1 iteratively as they move. Conjugate Gradient is an intermediate between steepest descent and Newtons method. It tries to achieve the quadratic convergence of Newtons method without incurring the cost of computing Hf . At the same time, Conjugate Gradient will execute at least one gradient descent step per n steps. It has proved to be extremely eective in dealing with general objective functions and is considered among the best general purpose methods presently available.

References
[1] M. Erdmann. Lecture notes for 16-811 Mathematical Fundamentals for Robotics. The Robotics Institute, Carnegie Mellon University, 1998. [2] D. G. Luenberger. Linear and Nonlinear Programming. Addison-Wesley, 2nd edition, 1984.

You might also like