Optimization Techniques On Riemannian Manifolds
Optimization Techniques On Riemannian Manifolds
Optimization Techniques On Riemannian Manifolds
Volume 3, 1994
Optimization Techniques on
Riemannian Manifolds
arXiv:1407.5965v1 [math.OC] 22 Jul 2014
Steven T. Smith
Harvard University
Division of Applied Sciences
Cambridge, Massachusetts 02138
Abstract. The techniques and analysis presented in this paper provide new meth-
ods to solve optimization problems posed on Riemannian manifolds. A new point of
view is offered for the solution of constrained optimization problems. Some classical
optimization techniques on Euclidean space are generalized to Riemannian manifolds.
Several algorithms are presented and their convergence properties are analyzed em-
ploying the Riemannian structure of the manifold. Specifically, two apparently new
algorithms, which can be thought of as Newton’s method and the conjugate gradient
method on Riemannian manifolds, are presented and shown to possess, respectively,
quadratic and superlinear convergence. Examples of each method on certain Rieman-
nian manifolds are given with the results of numerical experiments. Rayleigh’s quotient
defined on the sphere is one example. It is shown that Newton’s method applied to
this function converges cubically, and that the Rayleigh quotient iteration is an effi-
cient approximation of Newton’s method. The Riemannian version of the conjugate
gradient method applied to this function gives a new algorithm for finding the eigen-
vectors corresponding to the extreme eigenvalues of a symmetric matrix. Another
example arises from extremizing the function tr ΘT QΘN on the special orthogonal
group. In a similar example, it is shown that Newton’s method applied to the sum
of the squares of the off-diagonal entries of a symmetric matrix converges cubically.
Keywords. Optimization, constrained optimization, Riemannian manifolds, Lie
groups, homogeneous spaces, steepest descent, Newton’s method, conjugate gradient
method, eigenvalue problem, Rayleigh’s quotient, Rayleigh quotient iteration, Jacobi
methods, numerical methods.
1 Introduction
The preponderance of optimization techniques address problems posed on Eu-
clidean spaces. Indeed, several fundamental algorithms have arisen from the de-
sire to compute the minimum of quadratic forms on Euclidean space. However,
many optimization problems are posed on non-Euclidean spaces. For example,
finding the largest eigenvalue of a symmetric matrix may be posed as the max-
imization of Rayleigh’s quotient defined on the sphere. Optimization problems
subject to nonlinear differentiable equality constraints on Euclidean space also
c 1993 by the American Mathematical Society. 113
114 STEVEN T. SMITH
lie within this category. Many optimization problems share with these examples
the structure of a differentiable manifold endowed with a Riemannian metric.
This is the subject of this paper: the extremization of functions defined on
Riemannian manifolds.
The minimization of functions on a Riemannian manifold is, at least locally,
equivalent to the smoothly constrained optimization problem on a Euclidean
space, because every C ∞ Riemannian manifold can be isometrically imbedded
in some Euclidean space [46, Vol. V]. However, the dimension of the Euclidean
space may be larger than the dimension of the manifold; practical and aes-
thetic considerations suggest that one try to exploit the intrinsic structure of
the manifold. Elements of this spirit may be found throughout the field of
numerical methods, such as the emphasis on unitary (norm preserving) trans-
formations in numerical linear algebra [22], or the use of feasible direction meth-
ods [18, 21, 38].
An intrinsic approach leads one from the extrinsic idea of vector addition to
the exponential map and parallel translation, from minimization along lines to
minimization along geodesics, and from partial differentiation to covariant dif-
ferentiation. The computation of geodesics, parallel translation, and covariant
derivatives can be quite expensive. For an n-dimensional manifold, the compu-
tation of geodesics and parallel translation requires the solution of a system of
2n nonlinear and n linear ordinary differential equations. Nevertheless, many
optimization problems are posed on manifolds that have an underlying algebraic
structure that may be exploited to greatly reduce the complexity of these compu-
tations. For example, on a real compact semisimple Lie group endowed with its
natural Riemannian metric, geodesics and parallel translation may be computed
via matrix exponentiation [24]. Several algorithms are available to perform this
computation [22, 32]. This algebraic structure may be found in the problems
posed by Brockett [8, 9, 10], Bloch et al. [3, 4], Smith [45], Faybusovich [17],
Lagarias [30], Chu et al. [13, 14], Perkins et al. [35], and Helmke [25]. This
approach is also applicable if the manifold can be identified with a symmetric
space or, excepting parallel translation, a reductive homogeneous space [29, 33].
Perhaps the simplest nontrivial example is the sphere, where geodesics and par-
allel translation can be computed at low cost with trigonometric functions and
vector addition. Furthermore, Brown and Bartholomew-Biggs [11] show that
in some cases function minimization by following the solution of a system of
ordinary differential equations can be implemented such that it is competitive
with conventional techniques.
The outline of the paper is as follows. In Section 2, the optimization prob-
lem is posed and conventions to be held throughout the paper are established.
The method of steepest descent on a Riemannian manifold is described in Sec-
tion 3. To fix ideas, a proof of linear convergence is given. The examples of
Rayleigh’s quotient on the sphere and the function tr ΘT QΘN on the special
orthogonal group are presented. In Section 4, Newton’s method on a Rieman-
nian manifold is derived. As in Euclidean space, this algorithm may be used to
compute the extrema of differentiable functions. It is proved that this method
converges quadratically. The example of Rayleigh’s quotient is continued, and it
OPTIMIZATION ON RIEMANNIAN MANIFOLDS 115
is shown that Newton’s method applied to this function converges cubically, and
is approximated by the Rayleigh quotient iteration. The example considering
tr ΘT QΘN is continued. In a related example, it is shown that Newton’s method
applied to the sum of the squares of the off-diagonal elements of a symmetric
matrix converges cubically. This provides an example of a cubically convergent
Jacobi-like method. The conjugate gradient method is presented in Section 5
with a proof of superlinear convergence. This technique is shown to provide
an effective algorithm for computing the extreme eigenvalues of a symmetric
matrix. The conjugate gradient method is applied to the function tr ΘT QΘN .
2 Preliminaries
This paper is concerned with the following problem.
There are many well-known algorithms for solving this problem in the case
where M is a Euclidean space. This paper generalizes several of these algorithms
to the case of complete Riemannian manifolds by replacing the Euclidean no-
tions of straight lines and ordinary differentiation with geodesics and covariant
differentiation. These concepts are reviewed in the following paragraphs. We
follow Helgason’s [24] and Spivak’s [46] treatments of covariant differentiation,
the exponential map, and parallel translation. Details may be found in these
references.
Let M be a complete n-dimensional Riemannian manifold with Riemannian
structure g and corresponding Levi-Civita connection ∇. Denote the tangent
plane at p in M by Tp or Tp M . For every p in M , the Riemannian structure g
provides an inner product on Tp given by the nondegenerate symmetric bilinear
form gp : Tp ×Tp → R. The notation hX, Y i = gp (X, Y ) and kXk = gp (X, X)1/2 ,
where X, Y ∈ Tp , is often used. The distance between two points p and q in M is
denoted by d(p, q). The gradient of a real-valued C ∞ function f on M at p, de-
noted by (gradf )p , is the unique vector in Tp such that dfp (X) = h(gradf )p , Xi
for all X in Tp .
Denote the set of C ∞ functions on M by C ∞ (M ) and the set of C ∞ vector
fields on M by X(M ). An affine connection on M is a function ∇ which assigns
to each vector field X ∈ X(M ) an R-linear map ∇X : X(M ) → X(M ) which
satisfies
for all λ ≥ 0.
Step 2. Set
pi+1 = exppi λi Gi ,
Gi+1 = −(gradf )pi+1 ,
increment i, and go to Step 1.
λn−1
f (expp λX) = f (p) + λ(∇X̃f )(p) + · · · + (∇n−1f )(p)
(n − 1)! X̃
1
(1)
λn
Z
n−1 n
+ (1 − t) (∇X̃f )(expp tλX) dt.
(n − 1)! 0
Furthermore, when n = 1, Eq. (1) applied to the function X̃f = ∇X̃f yields
Z 1
2
(X̃f )(expp λX) = (X̃f )(p) + λ (∇X̃f )(expp tλX) dt. (3)
0
where λi is chosen such that f (exp λi Hi ) ≤ f (exp λHi ) for all λ ≥ 0. Then
there exists a constant E and a θ ∈ [0, 1) such that for all i = 0, 1, . . . ,
Proof. The proof is a generalization of the one given in Polak [36, p. 242ff] for
the method of steepest descent on Euclidean space.
The existence of a convergent sequence is guaranteed by the smoothness of f .
If pj = p̂ for some integer j, the assertion becomes trivial; assume otherwise.
By the smoothness of f, there exists an open neighborhood U of p̂ such that
(∇2f )p is positive definite for all p ∈ U . Therefore, there exist constants k > 0
and K ≥ k > 0 such that for all X ∈ Tp and all p ∈ U ,
Next, use (6) with Schwarz’s inequality and the first inequality of (7) to obtain
Z 1
kd2 (pi , p̂) = kkXi k2 ≤ 2
(∇X̃ i
f )(expp̂ tXi ) dt = (X̃i f )(pi )
0
= dfpi (X̃i )pi = dfpi (τ Xi ) = h(gradf )pi , τ Xi i
≤ k(gradf )pi k kτ Xi k = k(gradf )pi k d(pi , p̂).
Therefore,
k(gradf )pi k ≥ kd(pi , p̂). (8)
120 STEVEN T. SMITH
Using assumption (ii) of the theorem along with (5) we establish for λ ≥ 0
∆(Hi , λ) ≤ −λck(gradf )pi k kHi k + 21 λ2 KkHi k2 . (9)
We may now compute an upper bound for the rate of linear convergence
θ. By assumption (i) of the theorem, λ must be chosen to minimize the right
hand side of (9). This corresponds to choosing λ = ck(gradf )pi k KkHi k. A
computation reveals that
c2
∆(Hi , λi ) ≤ − k(gradf )pi k2 .
2K
Applying (7) and (8) to this inequality and rearranging terms yields
f (pi+1 ) − f (p̂) ≤ θ f (pi ) − f (p̂) , (10)
where θ = 1 − (ck/K)2 . By assumption, c ∈ (0, 1] and 0 < k ≤ K, therefore
θ ∈ [0, 1). (Note that Schwarz’s inequality bounds c below unity.)
From (10)
it is seen that f (pi ) − f (p̂) ≤ Eθi where E = f (p0 ) − f (p̂) . From (7) we
conclude that for i = 0, 1, . . . ,
2E √ i
r
d(pi , p̂) ≤ θ . (11)
k
Corollary 3.4. If Algorithm 3.1 converges to a local minimum, it converges
linearly.
The choice Hi = −(gradf )pi yields c = 1 in the second assumption the
Theorem 3.3, which establishes the corollary.
Example 3.5 (Rayleigh’s quotient on the sphere). Let S n−1 be the imbedded
sphere in Rn , i.e., S n−1 = { x ∈ Rn : xT x = 1 }, where xT y denotes the standard
inner product on Rn , which induces a metric on S n−1 . Geodesics on the sphere
are great circles and parallel translation along geodesics is equivalent to rotating
the tangent plane along the great circle. Let x ∈ S n−1 and h ∈ Tx have unit
length, and v ∈ Tx be any tangent vector. Then
expx th = x cos t + h sin t,
τ h = h cos t − x sin t,
τ v = v − (hT v) x sin t + h(1 − cos t) ,
where τ is the parallelism along the geodesic t 7→ exp th. Let Q be an n-by-n pos-
itive definite symmetric matrix with distinct eigenvalues and define ρ: S n−1 → R
by ρ(x) = xT Qx. A computation shows that
1
2 (grad ρ)x = Qx − ρ(x)x. (12)
OPTIMIZATION ON RIEMANNIAN MANIFOLDS 121
The function ρ has a unique minimum and maximum point at the eigenvec-
tors corresponding to the smallest and largest eigenvalues of Q, respectively.
Because S n−1 is geodesically complete, the method of steepest descent in the
opposite direction of the gradient converges to the eigenvector corresponding
to the smallest eigenvalue of Q; likewise for the eigenvector corresponding to
the largest eigenvalue. Chu [13] considers the continuous limit of this prob-
lem. A computation shows that ρ(x) is maximized along the geodesic expx th
(khk = 1) when a cos 2t − b sin 2t = 0, where a = 2xT Qh and b = ρ(x) − ρ(h).
Thus cos t and sin t may be computed with simple algebraic functions of a and b
(which appear below in Algorithm 5.5). The results of a numerical experiment
demonstrating the convergence of the method of steepest descent applied to
maximizing Rayleigh’s quotient on S 20 are shown in Figure 1 on page 133.
Example 3.6 (Brockett [9, 10]). Consider the function f (Θ) = tr ΘT QΘN
on the special orthogonal group SO(n), where Q is a real symmetric matrix
with distinct eigenvalues and N is a real diagonal matrix with distinct diagonal
elements. It will be convenient to identify tangent vectors in TΘ with tangent
vectors in TI ∼ = so(n), the tangent plane at the identity, via left translation.
The gradient of f (with respect to the negative Killing form of so(n), scaled
by 1/(n − 2)) at Θ ∈ SO(n) is Θ[H, N ], where H = AdΘT (Q) = ΘT QΘ. The
group SO(n) acts on the set of symmetric matrices by conjugation; the orbit
of Q under the action of SO(n) is an isospectral submanifold of the symmetric
matrices. We seek a Θ̂ such that f (Θ̂) is maximized. This point corresponds to
a diagonal matrix whose diagonal entries are ordered similarly to those of N . A
related example is found in Smith [45], who considers the homogeneous space
of matrices with fixed singular values, and in Chu [14].
The Levi-Civita connection on SO(n) is bi-invariant and invariant with re-
spect to inversion; therefore, geodesics and parallel translation may be computed
via matrix exponentiation of elements in so(n) and left (or right) translation [24,
Ch. II, Ex. 6]. The geodesic emanating from the identity in SO(n) in direction
X ∈ so(n) is given by the formula expI tX = etX , where the right hand side
denotes regular matrix exponentiation. The expense of geodesic minimization
may be avoided if instead one uses Brockett’s estimate [10] for the step size.
Given Ω ∈ so(n), we wish to find t > 0 such that φ(t) = tr Ade−tΩ (H)N is
minimized. Differentiating φ twice shows that φ0 (t) = − tr Ade−tΩ (adΩ H)N
and φ00 (t) = − tr Ade−tΩ (adΩ H) adΩ N , where adΩ A = [Ω, A]. Hence, φ0 (0) =
2 tr HΩN and, by Schwarz’s inequality and the fact that Ad is an isometry,
|φ00 (t)| ≤ k adΩ Hk k adΩ N k. We conclude that if φ0 (0) > 0, then φ0 is nonneg-
ative on the interval
2 tr HΩN
0≤t≤ , (13)
k adΩ Hk k adΩ N k
which provides an estimate for the step size of Step 1 in Algorithm 3.1. The
results of a numerical experiment demonstrating the convergence of the method
of steepest descent (ascent) in SO(20) using this estimate are shown in Figure 2.
122 STEVEN T. SMITH
with the forward map defined by X 7→ (∇X µ)p = (∇µ)p (·, X), which is nonsin-
gular. The notation (∇µ)p will henceforth be used for both the bilinear form
defined by the covariant differential of µ evaluated at p and the homomorphism
from Tp to Tp∗ induced by this bilinear form. In case of an isomorphism, the
inverse can be used to compute a point in M where µ vanishes, if such a point
exists. The case µ = df will be of particular interest, in which case ∇µ = ∇2f .
Before expounding on these ideas, we make the following remarks.
Remark 4.1 (The mean value theorem). Let M be a manifold with affine con-
nection ∇, Np a normal neighborhood of p ∈ M , the vector field X̃ on Np
adapted to X ∈ Tp , µ a one-form on Np , and τλ the parallelism with respect
to exp tX for t ∈ [0, λ]. Denote the point exp λX by pλ . Then there exists an
> 0 such that for every λ ∈ [0, ), there is an α ∈ [0, λ] such that
Proof. As in the proof of Remark 3.2, there exists an > 0 such that λX ∈ N0
for all λ ∈ [0, ). The map λ 7→ (τλ−1 µpλ )(A), for any A in Tp , is a C ∞ function
−1
t µpt )(A) = (d/dt)µpt (τt A) = ∇X̃ µpt (τt A) =
on [0, ) with derivative (d/dt)(τ
(∇X̃ µ)pt (τt A) + µpt ∇X̃ (τt A) = (∇X̃ µ)pt (τt A). The lemma follows from the
mean value theorem of real analysis.
This remark can be generalized in the following way.
λn−1 λn n
τλ−1 µpλ = µp + λ(∇X̃ µ)p + · · · + n−1
(∇X̃ µ)p + (∇ µ)pα ◦ τα . (14)
(n − 1)! n! X̃
The remark follows by applying Remark 4.1 and the Taylor’s theorem of real
analysis to the function λ 7→ (τλ−1 µpλ )(A) for any A in Tp .
Remarks 4.1 and 4.2 can be generalized to C ∞ tensor fields, but we will only
require Remark 4.2 for case n = 2 to make the following observation.
Let µ be a one-form on M such that for some p̂ in M , µp̂ = 0. Given any p
in a normal neighborhood of p̂, we wish to find X in Tp such that expp X = p̂.
Consider the Taylor expansion of µ about p, and let τ be the parallel translation
along the unique geodesic joining p to p̂. We have by our assumption that µ
vanishes at p̂, and from Eq. (14) for n = 2,
If the bilinear form (∇µ)p is nondegenerate, the tangent vector X may be ap-
proximated by discarding the higher order terms and solving the resulting linear
equation
µp + (∇µ)p (·, X) = 0
for X, which yields
X = −(∇µ)−1
p µp .
Step 1. Compute
Hi = −(∇µ)−1
pi µpi
pi+1 = exppi Hi ,
(assume that (∇µ)pi is nondegenerate), increment i, and repeat.
where τt is the parallel translation from pi to pt = exp tXi . The trivial identities
(∇X̃i df )pi = (∇2f )pi Xi and (∇X̃
2
df )pα = (∇3f )pα (τα ·, τα Xi , τα Xi ) will be used
i
to replace the last two terms on the right hand side of Eq. (17). Combining the
assumption that dfp̂ = 0 with Eqs. (16) and (17), we obtain
where the induced norm on Tp∗ is used in all three cases. Taking the norm
of both sides of Eq. (18), applying the triangle inequality to the right hand
side, and using the fact that parallel translation is an isometry, we obtain the
inequality
δ 0 d(pi+1 , p̂) ≤ δ 000 d2 (pi , p̂) + δ 00 kΞi k. (19)
The length of Ξi can be bounded by a cubic expression in d(pi , p̂) by con-
sidering the distance between the points exp(Hi + τ −1 Xi+1 ) and exp Xi+1 = p̂.
Given p ∈ M , > 0 small enough, let a, v ∈ Tp be such that kak + kvk ≤ , and
OPTIMIZATION ON RIEMANNIAN MANIFOLDS 125
let τ be the parallel translation with respect to the geodesic from p to q = expp a.
Karcher [28, App. C2.2] shows that
where K is the sectional curvature of M along any section in the tangent plane
at any point near p.
There exists a constant c > 0 such that kΞi k ≤ c d p̂, exp(Hi + τ −1 Xi+1 ) .
By (20), we have kΞi k ≤ const. kHi k2 . Taking the norm of both sides of the
R1
Taylor formula dfpi = − 0 (∇X̃i df )(exp tXi ) dt and applying a standard integral
inequality and inequality (ii) from above yields kdfpi k ≤ δ 00 kXi k so that kHi k ≤
const. kXi k. Furthermore, we have the triangle inequality kXi+1 k ≤ kXi k +
kHi k, therefore may be chosen such that kHi k + kXi+1 k ≤ ≤ const. kXi k.
By (20) there exists δ iv > 0 such that kΞi k ≤ δ iv d3 (pi , p̂).
Corollary 4.5. If (∇2f )p̂ is positive (negative) definite and Algorithm 4.3
converges to p̂, then Algorithm 4.3 converges quadratically to a local minimum
(maximum) of f .
Example 4.6 (Rayleigh’s quotient on the sphere). Let S n−1 and ρ(x) = xT Qx
be as in Example 3.5. It will be convenient to work with the coordinates x1 , . . . ,
xn of the ambient space Rn , treat the tangent plane Tx S n−1 as a vector subspace
of Rn , and make the identification Tx S n−1 ∼ = Tx∗ S n−1 via the metric. In this
coordinate system, geodesics on the sphere obey the second order differential
equation ẍk + xk = 0, k = 1, . . . , n. Thus the Christoffel symbols are given
by Γkij = δij xk , where δij is the Kronecker delta. The ijth component of the
second covariant differential of ρ at x in S n−1 is given by (cf. Eq. (4))
X
(∇2ρ)x ij = 2Qij − δij xk · 2Qkl xl = 2 Qij − ρ(x)δij ,
k,l
(xTA−1 v)
u = A−1 v − T −1 x . (22)
(x A x)
For Newton’s method, the direction Hi in Tx S n−1 is the solution of the equation
Proof 2. The proof follows Parlett’s [34, p. 72ff] proof of cubic convergence
for the Rayleigh quotient iteration. Assume that for all i, xi 6= x̂, and denote
ρ(xi ) by ρi . For all i, there is an angle ψi and a unit length vector ui defined
by the equation xi = x̂ cos ψi + ui sin ψi , such that x̂T ui = 0. By Algorithm 4.7
xi+1 = x̂ cos ψi+1 + ui+1 sin ψi+1 = xi cos θi + Hi sin θi /θi
αi sin θi αi sin θi
= x̂ + βi cos ψi + (Q − ρi I)−1 ui + βi ui sin ψi ,
(λ − ρi )θi θi
where βi = cos θi − sin θi /θi . Therefore,
αi sin θi
θi (Q − ρi I)−1 ui + βi ui
| tan ψi+1 | =
αi sin θi
· | tan ψi |. (23)
(λ−ρi )θi + βi
OPTIMIZATION ON RIEMANNIAN MANIFOLDS 127
The following equalities and low order approximations in terms of the small
quantities λ−ρi , θi , and ψi are straightforward to establish: λ − ρi = (λ − ρ(ui ))
× sin2 ψi , θi2 = cos2 ψi sin2 ψi + h.o.t., αi = (λ − ρi ) + h.o.t., and βi = −θi2 /3 +
h.o.t. Thus, the denominator of the large fraction in Eq. (23) is of order unity
and the numerator is of order sin2 ψi . Therefore, we have
|ψi+1 | = const. |ψi |3 + h.o.t.
Remark 4.9. If Algorithm 4.7 is simplified by replacing Step 2 with
Step 2.0 Compute
xi+1 = yi kyi k,
increment i, and go to Step 1.
then we obtain the Rayleigh quotient iteration. These two algorithms differ by
the method in which they use the vector yi = (Q − ρ(xi )I)−1 xi to compute the
next iterate on the sphere. Algorithm 4.7 computes the point Hi in Txi S n−1
where yi intersects this tangent plane, then computes xi+1 via the exponential
map of this vector (which “rolls” the tangent vector Hi onto the sphere). The
Rayleigh quotient iteration computes the intersection of yi with the sphere itself
and takes this intersection to be xi+1 . The latter approach approximates Algo-
rithm 4.7 up to quadratic terms when xi is close to an eigenvector. Algorithm 4.7
is more expensive to compute than—though of the same order as—the Rayleigh
quotient iteration; thus, the RQI is seen to be an efficient approximation of
Newton’s method.
If the exponential map is replaced by the chart v ∈ Tx →
7 (x + v)/kx + vk ∈
S n−1 , Shub [39] shows that a corresponding version of Newton’s method is
equivalent to the RQI.
Example 4.10 (The function tr ΘT QΘN ). Let Θ, Q, H = AdΘT (Q), and Ω be
as in Example 3.6. The second covariant differential of f (Θ) = tr ΘT QΘN may
be computed either by polarization of the second order term of tr Ade−tΩ (H)N ,
or by covariant differentiation of the differential dfΘ = − tr[H, N ]ΘT (·):
(∇2f )Θ (ΘX, ΘY ) = − 21 tr [H, adX N ] − [adX H, N ] Y,
If the hi are close to ανi , α ∈ R, for all i, then (∇3f )Θ (·, ΘX, ΘX) may be
small, yielding a fast rate of quadratic convergence.
Therefore, the numerator of the right hand side of Eq. (26) multiplied by the
step size λi can be approximated by the equation
λi (∇2f )pi+1 (τ Hi , Gi+1 ) = dfpi+1 (Gi+1 ) − (τ dfpi )(Gi+1 )
= −hGi+1 − τ Gi , Gi+1 i
because, by definition, Gi = −(gradf )pi , i = 0, 1, . . . , and for any X in Tpi+1 ,
(τ dfpi )(X) = dfpi (τ −1 X) = h(gradf )pi , τ −1 Xi = hτ (gradf )pi , Xi. Similarly,
the denominator of the right hand side of Eq. (26) multiplied by λi can be
approximated by the equation
λi (∇2f )pi+1 (τ Hi , τ Hi ) = dfpi+1 (τ Hi ) − (τ dfpi )(τ Hi )
= hGi , Hi i
OPTIMIZATION ON RIEMANNIAN MANIFOLDS 131
hGi+1 − τ Gi , Gi+1 i
γi = . (27)
hGi , Hi i
for all λ ≥ 0.
Step 3. Set
smoothness of f and exp, ν∗f has a critical point at 0 ∈ Rn such that the Hes-
sian matrix of ν∗f at 0 is positive definite. Indeed, by the fact that (d exp)0 = id,
the ijth component of the Hessian matrix of ν∗f at 0 is given by (d2f )p̂ (Xi , Xj ).
Therefore, there exists a neighborhood U of 0 ∈ Rn , a constant θ0 > 0, and
an integer N , such that for any initial point x0 ∈ U , the conjugate gradient
method on Euclidean space (with resets) applied to the function ν∗f yields a
sequence of points xi converging to 0 such that for all i ≥ N ,
kxi+n k ≤ θ0 kxi k2 .
See Polak [36, p. 260ff] for a proof of this fact. Let x0 = ν(p0 ) in U be an initial
point. Because exp is not an isometry, Algorithm 5.2 yields a different sequence
of points in Rn than the classical conjugate gradient method on Rn (upon
equating points in a neighborhood of p̂ ∈ M with points in a neighborhood
of 0 ∈ Rn via the normal coordinates).
Nevertheless, the amount by which exp fails to preserve inner products can
be quantified via the Gauss Lemma and Jacobi’s equation; see, e.g., Cheeger
and Ebin [12], or the appendices of Karcher [28]. Let t be small, and let X ∈ Tp̂
and Y ∈ TtX (Tp̂ ) ∼
= Tp̂ be orthonormal tangent vectors. The amount by which
the exponential map changes the length of tangent vectors is approximated by
the Taylor expansion
kd exp(tY )k2 = t2 − 31 Kt4 + h.o.t.
where K is the sectional curvature of M along the section in Tp̂ spanned by
X and Y . Therefore, near p̂ Algorithm 5.2 differs from the conjugate gradient
method on Rn applied to the function ν∗f only by third order and higher terms.
Thus both algorithms have the same rate of convergence. The theorem follows.
100
Step 2. Set
Step 3. Set
Gi+1 = Q − ρ(xi+1 )I xi+1 ,
(Gi+1 − τ Gi )T Gi+1
Hi+1 = Gi+1 + γi τ Hi , γi = .
GTi Hi
If i ≡ n − 1 (mod n), set Hi+1 = Gi+1 . Increment i, and go to Step 1.
Fuhrmann and Liu [20] provide a conjugate gradient algorithm for Rayleigh’s
quotient on the sphere that uses an azimuthal projection onto tangent planes.
101
∗∗bsbbbbbbbbbbb
100 ∗∗∗∗∗ bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
∗∗∗∗∗∗ bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
∗∗∗∗∗∗ bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
∗∗∗∗
s ∗∗
10−1 ∗
∗∗∗∗
∗∗∗∗∗∗
∗∗∗∗∗∗
∗∗∗∗ Method of Steepest Descent b
kHi − Di k 10−2 ∗∗∗ Conjugate Gradient Method ∗
∗∗
∗∗∗∗∗∗ Newton’s Method s
10−3 ∗∗∗∗∗∗
∗∗∗∗∗∗
∗∗∗∗
∗∗∗
s ∗∗∗∗∗∗∗∗∗
10−4 s
ssss
10−5
0 20 40 60 80 100 120 140
Step i
Figure 2: Maximization of tr ΘT QΘN on SO(20) (dimension SO(20) = 190), where
N = diag(20, . . . , 1). The ith iterate is Hi = ΘT i QΘi , Di is the diagonal matrix
of eigenvalues of Hi , H0 is near N , and k · k is the norm induced by the standard
inner product on gl(n). Geodesics and parallel translation were computed using the
algorithm of Ward and Gray [47, 48]; the step sizes for the method of steepest descent
and the conjugate gradient method were computed using Brockett’s estimate [10].
by g. Brockett’s estimate (n.b. Eq. (13)) for the step size may be used in Step 1
of Algorithm 5.2. The results of a numerical experiment demonstrating the
convergence of the conjugate gradient method in SO(20) are shown in Figure 2.
References
[1] Bertsekas, D. P. Projected Newton methods for optimization problems with simple
constraints, SIAM J. Cont. Opt. 20 : 221–246, 1982.
[2] . Constrained Optimization and Lagrange Multiplier Methods. New York: Aca-
demic Press, 1982.
[3] Bloch, A. M., Brockett, R. W., and Ratiu, T. S. A new formulation of the generalized
Toda lattice equations and their fixed point analysis via the momentum map, Bull. Amer.
Math. Soc. 23 (2) : 477–485, 1990.
OPTIMIZATION ON RIEMANNIAN MANIFOLDS 135
[4] . Completely integrable gradient flows, Commun. Math. Phys. 147 : 57–74, 1992.
[5] Botsaris, C. A. Differential gradient methods, J. Math. Anal. Appl. 63 : 177–198, 1978.
[6] . A class of differential descent methods for constrained optimization, J. Math.
Anal. Appl. 79 : 96–112, 1981.
[7] . Constrained optimization along geodesics, J. Math. Anal. Appl. 79 : 295–306,
1981.
[8] Brockett, R. W. Least squares matching problems, Lin. Alg. Appl. 122/123/124 :
761–777, 1989.
[9] . Dynamical systems that sort lists, diagonalize matrices, and solve linear pro-
gramming problems, Lin. Alg. Appl. 146 : 79–91, 1991.
[10] . Differential geometry and the design of gradient algorithms, Proc. Symp. Pure
Math. R. Green and S. T. Yau, eds. Providence, RI: Amer. Math. Soc., to appear.
[11] Brown, A. A. and Bartholomew-Biggs, M. C. Some effective methods for uncon-
strained optimization based on the solution of systems of ordinary differential equations,
J. Optim. Theory Appl. 62 (2) : 211–224, 1989.
[12] Cheeger, J. and Ebin, D. G. Comparison Theorems in Riemannian Geometry. Ams-
terdam: North-Holland Publishing Company, 1975.
[13] Chu, M. T. Curves on S n−1 that lead to eigenvalues or their means of a matrix, SIAM
J. Alg. Disc. Meth. 7 (3) : 425–432, 1986.
[14] Chu, M. T. and Driessel, K. The projected gradient method for least squares matrix ap-
proximations with spectral constraints, SIAM J. Numer. Anal. 27 (4) : 1050–1060, 1990.
[15] Dunn, J. C. Newton’s method and the Goldstein step length rule for constrained mini-
mization problems, SIAM J. Cont. Opt. 18 : 659–674, 1980.
[16] . Global and asymptotic convergence rate estimates for a class of projected gra-
dient processes, SIAM J. Cont. Opt. 19 : 368–400, 1981.
[17] Faybusovich, L. Hamiltonian structure of dynamical systems which solve linear pro-
gramming problems, Phys. D 53 : 217–232, 1991.
[18] Fletcher, R. Practical Methods of Optimization, 2d ed. New York: Wiley & Sons, 1987.
[19] Fletcher, R. and Reeves, C. M. Function minimization by conjugate gradients, Com-
put. J. 7 (2) : 149–154, 1964.
[20] Fuhrmann, D. R. and Liu, B. An iterative algorithm for locating the minimal eigenvector
of a symmetric matrix, Proc. IEEE ICASSP 84 pp. 45.8.1–4, 1984.
[21] Gill, P. E. and Murray, W. Newton-type methods for linearly constrained optimiza-
tion, in Numerical Methods for Constrained Optimization. P. E. Gill and W. Murray,
eds. London: Academic Press, Inc., 1974.
[22] Golub, G. H. and Van Loan, C. Matrix Computations. Baltimore, MD: Johns Hopkins
University Press, 1983.
[23] Golubitsky, M. and Guillemin, V. Stable Mappings and Their Singularities. New York:
Springer-Verlag, 1973.
[24] Helgason, S. Differential Geometry, Lie Groups, and Symmetric Spaces. New York:
Academic Press, 1978.
[25] Helmke, U. Isospectral flows on symmetric matrices and the Riccati equation, Systems &
Control Lett. 16 : 159–165, 1991.
[26] Hestenes, M. R. and Stiefel, E. Methods of conjugate gradients for solving linear
systems, J. Res. Nat. Bur. Stand. 49 : 409–436, 1952.
[27] Hirsch, M. W. and Smale, S. On algorithms for solving f (x) = 0, Comm. Pure Appl.
Math. 32 : 281–312, 1979.
[28] Karcher, H. Riemannian center of mass and mollifier smoothing, Comm. Pure Appl.
Math. 30 : 509–541, 1977.
[29] Kobayashi, S. and Nomizu, K. Foundations of Differential Geometry, Vol. 2. New York:
Wiley Interscience Publishers, 1969.
[30] Lagarias, J. C. Monotonicity properties of the Toda flow, the QR-flow, and subspace
iteration, SIAM J. Numer. Anal. Appl. 12 (3) : 449–462, 1991.
[31] Luenberger, D. G. Introduction to Linear and Nonlinear Programming. Reading, MA:
Addison-Wesley, 1973.
136 STEVEN T. SMITH
[32] Moler, C. and Van Loan, C. Nineteen dubious ways to compute the exponential of a
matrix, SIAM Rev. 20 (4) : 801–836, 1978.
[34] Parlett, B. The Symmetric Eigenvalue Problem. Englewood Cliffs, NJ: Prentice-Hall,
1980.
[35] Perkins, J. E., Helmke, U., and Moore, J. B. Balanced realizations via gradient flow
techniques, Systems & Control Lett. 14 : 369–380, 1990.
[36] Polak, E. Computational Methods in Optimization. New York: Academic Press, 1971.
[37] Rudin, W. Principles of Mathematical Analysis, 3d ed. New York: McGraw-Hill, 1976.
[38] Sargent, R. W. H. Reduced gradient and projection methods for nonlinear program-
ming, in Numerical Methods for Constrained Optimization. P. E. Gill and W. Murray,
eds. London: Academic Press, Inc., 1974.
[39] Shub, M. Some remarks on dynamical systems and numerical analysis, in Dynamical
Systems and Partial Differential Equations: Proc. VII ELAM. L. Lara-Carrero and
J. Lewowicz, eds. Caracas: Equinoccio, U. Simón Bolı́var, pp. 69–92, 1986.
[43] Smale, S. The fundamental theorem of algebra and computational complexity, Bull.
Amer. Math. Soc. 4 (1) : 1–36, 1981.
[44] . On the efficiency of algorithms in analysis, Bull. Amer. Math. Soc. 13 (2) : 87–
121, 1985.
[45] Smith, S. T. Dynamical systems that perform the singular value decomposition, Sys-
tems & Control Lett. 16 : 319–327, 1991.
[48] Ward, R. C. and Gray, L. J. Algorithm 530: An algorithm for computing the eigensys-
tem of skew-symmetric matrices and a class of symmetric matrices, ACM Trans. Math.
Softw. 4 (3) : 286–289, 1978. See also Collected Algorithms from ACM, Vol. 3. New York:
Assoc. Comput. Mach., 1978.