Interior Gradient and Proximal Methods For Convex and Conic Optimization
Interior Gradient and Proximal Methods For Convex and Conic Optimization
Interior Gradient and Proximal Methods For Convex and Conic Optimization
ALFRED AUSLENDER
Abstract. Interior gradient (subgradient) and proximal methods for convex constrained min-
imization have been much studied, in particular for optimization problems over the nonnegative
octant. These methods are using non-Euclidean projections and proximal distance functions to ex-
ploit the geometry of the constraints. In this paper, we identify a simple mechanism that allows
us to derive global convergence results of the produced iterates as well as improved global rates of
convergence estimates for a wide class of such methods, and with more general convex constraints.
Our results are illustrated with many applications and examples, including some new explicit and
simple algorithms for conic optimization problems. In particular, we derive a class of interior gradient
algorithms which exhibits an O(k
2
) global convergence rate estimate.
Key words. convex optimization, interior gradient/subgradient algorithms, proximal distances,
conic optimization, convergence and eciency
AMS subject classications. 90C25, 90C30, 90C22
DOI. 10.1137/S1052623403427823
1. Introduction. Consider the following convex minimization problem:
(P) f
= inff(x) [ x C,
where C denotes the closure of C, a nonempty convex open set in R
n
and f : R
n
produced by either one of the above algorithms does not necessarily belong to C.
In this paper, the proximal term d(x, y) will play the role of a distance-like function
satisfying certain desirable properties (see section 2), which will force the iterates of
the produced sequence to stay in C, and thus automatically eliminate the constraints.
Received by the editors May 13, 2003; accepted for publication (in revised form) June 4, 2005;
published electronically January 6, 2006.
http://www.siam.org/journals/siopt/16-3/42782.html
F(x) = g R
n
[ z R
n
, F(z) + F(x) +g, z x), which coincides with
the usual subdierential F
0
F whenever = 0. We set domF = x R
n
[
F(x) ,= . For any closed convex set S R
n
,
S
denotes the indicator function
of S, ri S its relative interior, and N
S
(x) =
S
(x) = R
n
[ , z x) 0 z S
the normal cone to S at x S. The set of n-vectors with nonnegative (positive)
components is denoted by R
n
+
(R
n
++
).
2. A general framework for interior proximal methods. Let C be a non-
empty convex open set in R
n
and f : R
n
R + a proper, lsc, and convex
function. Consider the optimization problem
(P) f
= inff(x) [ x C,
where C denotes the closure of C. Unless otherwise specied, throughout this paper
we make the following standing assumptions on (P):
(a) domf C ,= ,
(b) < f
.
We study the behavior of the following basic proximal iterative scheme to solve (P):
x
k
argmin
k
f(x) + d(x, x
k1
) [ x C, k = 1, 2, . . . (
k
> 0),
where d is some proximal distance. Our approach is motivated by and patterned after
many of the studies mentioned in the introduction, and our objective is to develop a
general framework to analyze the convergence of the resulting methods under various
settings. Given the optimization problem (P), essentially the basic ingredients needed
to achieve the aforementioned goals are
to pick an appropriate proximal distance d which allows us to eliminate the
constraints,
700 ALFRED AUSLENDER AND MARC TEBOULLE
given d, to nd an induced proximal distance H, which will control the be-
havior of the resulting method.
We begin by dening an appropriate proximal distance d for problem (P).
Definition 2.1. A function d : R
n
R
n
R
+
+ is called a proximal
distance with respect to an open nonempty convex set C R
n
if for each y C it
satises the following properties:
(P
1
) d(, y) is proper, lsc, convex, and C
1
on C;
(P
2
) domd(, y) C and dom
1
d(, y) = C, where
1
d(, y) denotes the subgradi-
ent map of the function d(, y) with respect to the rst variable;
(P
3
) d(, y) is level bounded on R
n
, i.e., lim
u
d(u, y) = +;
(P
4
) d(y, y) = 0.
We denote by T(C) the family of functions d satisfying Denition 2.1. Prop-
erty (P
1
) is needed to preserve convexity of d(, y), (P
2
) will force the iterate x
k
to
stay in C, and (P
3
) is used to guarantee the existence of such an iterate. For each
y C, let
1
d(, y) denote the gradient map of the function d(, y) with respect to the
rst variable. Note that by denition d(, ) 0, and from (P
4
) the global minimum
of d(, y) is obtained at y, which shows that
1
d(y, y) = 0.
Proposition 2.1. Let d T(C), and for all y C consider the optimization
problem
P(y) f
(y) + . (2.2)
Proof. We set t(u) = f(u) + d(u, y) +
C
(u). Then by (P
2
) we have f
(y) =
inft(u) [ u R
n
. Furthermore, since f
is nite, it follows by (P
3
) that t() is
level bounded. Therefore with t() being a proper, lsc convex function, it follows that
S(y) is nonempty and compact. From the optimality conditions, for each u(y) S(y)
we have 0 t(u(y)). Now, since domf C ,= and C is open, we can apply [39,
Theorem 23.8] so that
t(u) = f(u) +
1
d(u, y) + N
C
(u) u.
Since dom
1
d(, y) = C, it follows that u(y) C, and hence N
C
(u(y)) = 0, and
(2.1) holds for = 0 with g f(u(y)). For > 0, (2.1) holds for such a pair (u(y), g)
since f(u(y))
f(u(y)),
we have
f(u) + d(u, y) f(u(y)) + d(u(y), y) +g +
1
d(u(y), y), u u(y))
so that f
k
f(x
k
) (2.3)
such that
k
g
k
+
1
d(x
k
, x
k1
) = 0. (2.4)
The IPA can be viewed as an approximate interior proximal method when
k
> 0
k N (the set of natural numbers), which becomes exact for the special case
k
= 0
k N.
The next step is to associate with each given d T(C) a corresponding proximal
distance satisfying some desirable properties needed to analyze the IPA.
Definition 2.2. Given C R
n
, open and convex, and d T(C), a function
H : R
n
R
n
R
+
+ is called the induced proximal distance to d if H is nite
valued on C C and for each a, b C satises
H(a, a) = 0, (2.5)
c b,
1
d(b, a)) H(c, a) H(c, b) c C. (2.6)
We write (d, H) T(C) to quantify the triple [C, d, H] that satises the premises
of Denition 2.2.
Likewise, we will write (d, H) T(C) for the triple [C, d, H] whenever there exists
H which is nite valued on C C, satises (2.5)(2.6) for any c C, and is such that
c C one has H(c, ) level bounded on C. Clearly, one has T(C) T(C).
The motivation behind such a construction is not as mysterious as it might look
at rst sight. Indeed, for the moment, notice that the classical PA, which corresponds
to the special case C = C = R
n
, d(x, y) = 2
1
|x y|
2
and the induced proximal
distance H being exactly d, clearly satises (2.6), thanks to the well-known identity
|z x|
2
= |z y|
2
+|y x|
2
+ 2z y, y x).
IPA with d H will be called self-proximal. Several useful examples of more
general self-proximal methods for various classes of constraint sets C will be given in
the next section.
As we shall see below, the requested properties for the function H associated
with d naturally emerge from the analysis of the classical PA as given in [24] and later
extended for various specic classes of IPA in [16, 44, 5]. Building on these works, we
can already easily obtain global rates of convergence estimates as well as convergence
in limit points of the produced sequence by IPA. To derive the global convergence
of the sequence x
k
to an optimal solution of (P), additional assumptions on the
induced proximal distance H, akin to the properties of norms, will be required.
Before giving our convergence results, we recall the following well-known proper-
ties on nonnegative sequences, which will be useful to us throughout this work.
Lemma 2.1 (see [35]). Let v
k
,
k
, and
k
be nonnegative sequences of real
numbers satisfying v
k+1
(1+
k
)v
k
+
k
and such that
k=1
k
< ,
k=1
k
< .
Then, the sequence v
k
converges.
Lemma 2.2 (see [35]). Let
k
be a sequence of positive numbers, a
k
a se-
quence of real numbers, and b
n
:=
1
n
n
k=1
k
a
k
, where
n
=
n
k=1
k
. If
n
,
one has
702 ALFRED AUSLENDER AND MARC TEBOULLE
(i) liminf a
n
liminf b
n
limsup b
n
limsup a
n
,
(ii) limb
n
= a whenever lima
n
= a.
Theorem 2.1. Let (d, H) T(C) and let x
k
be the sequence generated by
IPA. Set
n
=
n
k=1
k
. Then the following hold:
(i) f(x
n
) f(x)
1
n
H(x, x
0
) +
1
n
n
k=1
k
k
x C.
(ii) If lim
n
n
= + and
k
0, then liminf
n
f(x
n
) = f
and the
sequence f(x
k
) converges to f
whenever
k=1
k
< .
(iii) Furthermore, suppose the optimal set X
is bounded,
(b)
k=1
k
k
< and (d, H) T(C).
Then, under either (a) or (b), the sequence x
k
is bounded with all its limit
points in X
.
Proof. (i) From (2.4), since g
k
k
f(x
k
) we have
k
(f(x
k
) f(x)) x x
k
,
1
d(x
k
, x
k1
)) +
k
k
x C. (2.7)
Using (2.6) at the points c = x, a = x
k1
, b = x
k
, the above inequality implies that
k
(f(x
k
) f(x)) H(x, x
k1
) H(x, x
k
) +
k
k
x C. (2.8)
Summing over k = 1, . . . , n we obtain
n
f(x) +
n
k=1
k
f(x
k
) H(x, x
0
) H(x, x
n
) +
n
k=1
k
. (2.9)
Now setting x = x
k1
in (2.8), we obtain
f(x
k
) f(x
k1
)
k
. (2.10)
Multiplying the latter inequality by
k1
(with
0
0) and summing over k =
1, . . . , n, we obtain, after some algebra,
n
f(x
n
)
n
k=1
k
f(x
k
)
n
k=1
k1
k
.
Adding this inequality to (2.9) and recalling that
k
+
k1
=
k
, it follows that
f(x
n
) f(x)
1
n
[H(x, x
0
) H(x, x
n
)] +
1
n
n
k=1
k
x C, (2.11)
proving (i), since H(, ) 0.
(ii) If
n
+and
k
0, then dividing (2.9) by
n
and invoking Lemma 2.2(i),
we obtain from (2.9) that liminf
n
f(x
n
) inff(x) [ x C, which together with
f(x
n
) inff(x) [ x C implies that liminf
n
f(x
n
) = inff(x) [ x C = f
.
From (2.10) we have
0 f(x
k
) f
f(x
k1
) f
+
k
.
Then using Lemma 2.1 it follows that the sequence f(x
k
) converges to f
whenever
k=1
k
< .
INTERIOR GRADIENT AND PROXIMAL METHODS 703
(iii) Case (a): If X
k=1
k
k
< and that (d, H) T(C). Then
(2.8) holds for each x C, and in particular for x X
, so that
H(x, x
k
) H(x, x
k1
) +
k
k
x X
. (2.12)
Summing over k = 1, . . . , n, we obtain
H(x, x
n
) H(x, x
0
) +
k=1
k
.
But, since in this case H(x, ) is level bounded, the last inequality implies that the
sequence x
k
is bounded, and thus as in Case (a) it follows that all its limit points
are in X
.
An immediate byproduct of the above analysis yields the following global rate of
convergence estimate for the exact version of IPA, i.e., with
k
= 0 k.
Corollary 2.1. Let (d, H) T(C), X
,= , and x
k
be the sequence generated
by IPA with
k
= 0 k. Then, f(x
n
) f
= O(
1
n
) x C.
Proof. Under the given hypothesis, Theorem 2.1(i) holds for any x C, and it
follows that f(x
n
) f
(
n
)
1
H(x
, x
0
).
To establish the global convergence of the sequence x
k
to an optimal solu-
tion of problem (P), we need to make further assumptions on the induced proximal
distance H, mimicking the behavior of norms.
Let (d, H) T
+
(C) T(C) be such that the function H satises the following
two additional properties:
(a
1
) y C and y
k
C bounded with lim
k+
H(y, y
k
) = 0, we have
lim
k+
y
k
= y;
(a
2
) y C and y
k
C converging to y, we have lim
k+
H(y, y
k
) = 0.
With these additional hypotheses on H we immediately obtain that IPA globally
converges to an optimal solution of (P).
Theorem 2.2. Let (d, H) T
+
(C) and let x
k
be the sequence generated
by IPA. Suppose that the optimal set X
of (P) is nonempty,
n
=
n
k=1
k
,
k=1
k
k
< , and
k=1
k
< . Then the sequence x
k
converges to an optimal
solution of (P).
Proof. Let x X
n
k=1
k
k
<
+ and Lemma 2.1 we obtain that the sequence H(x, x
k
) converges to some
a(x) R x X
. Let x
. Then by assumption (a
2
) lim
l
H(x
, x
k
l
) = 0, so that
lim
k
H(x
, x
k
) = 0, and by assumption (a
1
) it follows that the sequence x
k
converges to x
.
Note that we have separated the two types of convergence results to emphasize
the dierences and roles played by each of the three classes T
+
(C) T(C)
T(C),
that the largest, and less demanding, class T(C) already provides reasonable
convergence properties for IPA, with minimal assumptions on the problems
data.
704 ALFRED AUSLENDER AND MARC TEBOULLE
These aspects are illustrated by several application examples in the next section.
Relations (2.3), (2.4) dening IPA can sometimes be dicult to implement, since
at each step we have to nd by some algorithm in a nite number of steps an
k
-solution for the minimization of the function
k
f() + d(, x
k1
). To overcome
this diculty, we consider here (among others) a variant of the approximate rule
proposed in [20] for self-proximal Bregman methods.
Interior proximal algorithm with approximation rule (IPA1). Let
(d, H) T(C),
,
k
> 0, and
k
> 0
with
k=1
k
< ,
k=1
k
< . Starting from a point x
0
C, for all k 1 we
generate the sequences x
k
k=1
C, e
k
k=1
R
n
via
e
k
=
k
g
k
+
1
d(x
k
, x
k1
) with g
k
f(x
k
), (2.13)
where the error sequence e
k
satises the conditions
|e
k
|
k
, |e
k
| sup(|x
k
|, |x
k1
|)
k
. (2.14)
Remark 2.1. From Proposition 2.1, a sequence x
k
given by relations (2.13),
(2.14) always exists. Furthermore, if f is C
1
on C (C
2
on C with d(, y) C
2
on C
for all y C), then any convergent gradient-type method (Newton-type method) will
provide such an x
k
in a nite number of steps.
Theorem 2.3. Let (d, H) T(C), and let x
k
be a sequence generated by
IPA1. Then we have the following:
(i) The sequence f(x
k
) converges to f
.
(ii) Furthermore, suppose that the optimal set X
is bounded;
(b) (d, H) T(C);
(c) (d, H) T
+
(C).
Then under (a) or (b), the sequence x
k
is bounded with all limit points
in X
k
(f(x
k
) f(x)) x x
k
,
1
d(x
k
, x
k1
)) +e
k
, x
k
x)
H(x, x
k1
) H(x, x
k
) +
k
(x), (2.15)
with
k
(x) := |e
k
||x| + x
k
, e
k
). Summing (2.15) over k = 1, . . . , n and dividing by
n
=
n
i=1
k
we obtain
f(x) +
n
k=1
k
f(x
k
)
n
1
n
_
H(x, x
0
) H(x, x
n
) +
n
k=1
k
(x)
_
. (2.16)
Now setting x = x
k1
in (2.15) and
k
:= [
k
(x
k1
)[
1
, we obtain (f(x
k
)
f(x
k1
))
k
. But using (2.14), one has
k=1
k
(x) < and
k=1
k
< .
Therefore by passing to the limit in (2.16) and invoking Lemma 2.2(i) it follows that
liminf
n
f(x
n
) f(x) 0 for each x C so that liminf
n
f(x
n
) inff(x) [
x C. From here, the proof can be completed with the same arguments as in the
proofs of Theorems 2.1 and 2.2.
Theorem 2.3(c) recovers and extends [20, Theorem 1, p. 120] for the case of convex
minimization, which was proved there only for the Bregman self-proximal method.
INTERIOR GRADIENT AND PROXIMAL METHODS 705
3. Proximal distances (d, H): Examples. It turns out that in most situa-
tions, when constructing an IPA for solving the convex problem (P), the proximal
distance H induced by d will be a Bregman proximal distance D
h
generated by some
convex kernel h. In the rst part of this section we recall the special features of the
Bregman proximal distance. In the second part we consider various types of con-
straint sets C for problem (P). We demonstrate through many examples for the pair
(d, H) that many well-known proximal methods, as well as new ones, can be handled
through our framework.
3.1. Bregman proximal distances. Let h : R
n
R+ be a proper, lsc,
and convex function with domh C and domh = C, strictly convex and continuous
on domh, C
1
on int domh = C. Dene
H(x, y) := D
h
(x, y) := h(x) [h(y) +h(y), x y)] x R
n
, y domh
= + otherwise. (3.1)
The function D
h
enjoys a remarkable three point identity [16, Lemma 3.1],
H(c, a) = H(c, b) + H(b, a) +c b,
1
H(b, a)) a, b C, c domh. (3.2)
This identity plays a central role in the convergence analysis.
To handle the constraint cases C versus C, we consider two types of kernels h.
The rst type consists of convex kernel functions h (often called a Bregman function
with zone C; see, e.g., [15]) that satisfy the following conditions:
(B
1
) domh = C;
(B
2
) (i) x C, D
h
(x, ) is level bounded on int(domh);
(ii) y C, D
h
(, y) is level bounded;
(B
3
) y domh, y
k
int(domh) with lim
k
y
k
= y, one has lim
k
D
h
(y, y
k
) = 0;
(B
4
) if y
k
is a bounded sequence in int(domh) and y domh such that lim
k
D
h
(y, y
k
) = 0, then y = lim
k
y
k
.
Note that (B
4
) is a direct consequence of the rst three properties, a fact proved by
Kiwiel in [29, Lemma 2.16].
Let B be the class of kernels h satisfying properties (B
1
)(B
4
). More general
Bregman proximal distances such as those introduced in [29] could also be candidates.
For the sake of simplicity we consider here only the case h B.
For the second type of kernels, we require the convex kernel h to satisfy two
(weaker)
1
conditions:
(WB
1
) domh = C;
(WB
2
) (i) x C, D
h
(x, ) is level bounded on C;
(ii) y C, D
h
(, y) is level bounded.
We denote by JB the set of such convex kernels h.
We give here some examples that underline the dierence between the classes B
and JB.
Example 3.1. Let C = R
n
++
. Separable Bregman proximal distances are the most
commonly used in the literature. Let : R R + be a proper convex and lsc
function with (0, +) dom [0, +) and such that C
2
(0, +),
(t) > 0
1
The terminology weaker is used here to indicate that weaker type of convergence results
can be derived for this class. Indeed, with h WB one has (d, D
h
) F(C) and only Theorem 2.1
(except (iii)(b)) can be applied.
706 ALFRED AUSLENDER AND MARC TEBOULLE
t > 0, and lim
t0
+
n
j=1
(x
j
) so that D
h
is separable. The rst two examples are
functions
0
, i.e., with dom = [0, +), and the last two are in
+
, i.e., with
dom = (0, +):
1
(t) = t log t (Shannon entropy),
2
(t) = (pt t
p
)/(1 p) with p (0, 1),
3
(t) = log t (Burg entropy),
4
(t) = t
1
.
More examples can be found in, e.g., [29, 43]. Then, the corresponding proximal
distances D
h1
, D
h2
B, while D
h3
, D
h4
JB.
3.2. Self-proximal methods. The three point identity (3.2) plays a fundamen-
tal role in the convergence of Bregman-based self-proximal methods, namely those for
which we take d itself as a Bregman proximal distance, that is, d(x, y) = H(x, y) =
D
h
(x, y), with D
h
as dened in (3.1). Whenever h B, or in JB, properties (P
1
),
(P
2
), and (P
3
) hold for d = D
h
.
Clearly, D
h
(a, a) = 0 a C, so that (P
4
) holds, and since H is always non-
negative it follows from (3.2) that (2.6) holds. Therefore for h JB one has
(d, H) = (D
h
, D
h
) T(C), while if h B, then (d, H) = (D
h
, D
h
) T
+
(C).
When C = R
n
, with h() = | |
2
/2 B, then D
h
(x, y) = |x y|
2
/2, and with
(d, H) = (D
h
, D
h
) T
+
(R
n
), the IPA is exactly the classical proximal method and
Theorems 2.1 and 2.2 cover the usual convergence results, e.g., [24, 30, 31].
We now list several interesting special cases for the pair (d, H) leading to self-
proximal schemes for various types of constraints.
Nonnegative constraints. Let C = R
n
++
and C = R
n
+
. For the examples given
in Example 3.1, the resulting self-proximal algorithms, namely with d = H = D
hi
,
yield (d, D
hi
) T
+
(C) for i = 1, 2 and (d, D
hi
) T(C) for i = 3, 4.
Semidenite constraints. We denote by S
n
the linear space of symmetric real
matrices equipped with the trace inner product x, y) := tr(xy) and |x| =
_
tr(x
2
)
x, y S
n
, where tr(x) is the trace of the matrix x and det x its determinant. The
cone of n n symmetric positive semidenite (positive denite) matrices is denoted
by S
n
+
(S
n
++
). Let C = S
n
++
and C = S
n
+
. Let h
1
: S
n
+
R, h
1
(x) = tr(xlog x) and
h
3
: S
n
++
R, h
3
(x) = tr(log x) = log det(x) (which corresponds to
1
and
3
,
respectively, of Example 3.1). For any y S
n
++
, let
d
1
(x, y) = tr(xlog x xlog y + y x) with domd
1
(, y) = S
n
+
,
d
3
(x, y) = tr(log x + log y + xy
1
) n
= log det(xy
1
) + tr(xy
1
) n with domd
3
(, y) = S
n
++
.
The proximal distances d
1
, d
3
are Bregman type corresponding to h
1
, h
3
, respec-
tively, and were proposed by Doljansky and Teboulle in [18], who derived conver-
gence results for the associated IPA. From the results of [18] it is easy to see that
d
i
T(C), i = 1, 3, and with H(x, y) = d
i
(x, y) it follows that (d
1
, H) T(S
n
+
) and
(d
3
, H) T(S
n
++
) so that we recover the convergence results of [18] through Theo-
rem 2.1. However, as noticed in a counterexample [18, Example 4.1], property (B
3
)
does not hold even for d
1
, and therefore (d
i
, H) / T
+
(C), i = 1, 3. Consequently,
Theorem 2.2 does not apply, i.e., global convergence to an optimal solution cannot
be guaranteed. Similar results can be easily extended to the more general case with
C = x R
m
[ B(x) S
n
++
assumed nonempty, with B(x) =
m
i=1
x
i
B
i
B
0
, where
INTERIOR GRADIENT AND PROXIMAL METHODS 707
B
i
S
n
i = 0, 1, . . . , m, and the map x
m
i=1
x
i
B
i
being onto, by considering the
corresponding proximal distances,
D
1
(x, y) = d
1
(B(x), B(y)), D
3
(x, y) = d
3
(B(x), B(y)).
Convex programming. Let f
i
: R
n
R be concave and C
1
on R
n
for each
i [1, m]. We suppose that Slaters condition holds, i.e., there exists some point
x
0
R
n
such that f
i
(x
0
) > 0 i [1, m] and that the open convex set C is described
by
C = x R
n
[ f
i
(x) > 0 i = 1, . . . , m
so that by Slaters assumption C ,= and C = x R
n
[ f
i
(x) 0, i [1, m].
Consider the class
+
of functions dened in Example 3.1, and for each
+
let
h(x) =
_
m
i=1
(f
i
(x)) if x C,
+ otherwise.
(3.3)
Obviously h is a proper, lsc, and convex function. Now, consider the Bregman prox-
imal distance associated with h
(x) := h(x) +
2
|x|
2
with > 0. Then, we take
d(x, y) = D
h
(x, y), where D
h
is the Bregman distance associated with h
. Thanks
to the condition > 0, it follows that h
JB and (d, D
h
) T(C). An important
and interesting case is obtained by choosing the Burg function,
3
(t) = log t. In
this case we obtain the following:
d(x, y) =
m
i=1
log
f
i
(x)
f
i
(y)
+
f
i
(y), x y)
f
i
(y)
+
2
|x y|
2
. (3.4)
Note that in this case the function d(, y) enjoys other interesting properties: for
example, when the functions f
i
are concave quadratic, then d(, y) is self-concordant
for each y C, a property which is very useful when minimizing the function with
Newton-type methods [33]. When = 0, i.e., with d = D
h
, such proximal distance
has been recently introduced by Alvarez, Bolte, and Brahic [1], in the context of
dynamical systems to study interior gradient ows, but it requires a nondegeneracy
condition, x C: spanf
i
(x) [ i = 1, . . . , m = R
n
, which is satised mostly in
the polyhedral case. Here, in the context of proximal methods, the addition of the
regularized term in D
h
precludes the use of such a condition.
Second order cone constraints. Let C = L
n
++
:= x R
n
[ x
n
> (x
2
1
+ +
x
2
n1
)
1/2
be the interior of the Lorentz cone, with closure denoted by L
n
+
. Let J
n
be a diagonal matrix with its rst (n 1) entries being 1 and the last being 1, and
dene h : L
n
++
R by h(x) = log(x
T
J
n
x). Then h is proper, lsc, and convex on
domh = L
n
++
. Let h
(x) = h(x)+|x|
2
/2. Then thanks to > 0, one has h
JB,
and the Bregman proximal distance associated with h
is given by
D
h
(x, y) = log
x
T
J
n
x
y
T
J
n
y
+
2x
T
J
n
y
y
T
J
n
y
2 +
2
|x y|
2
, (3.5)
and we have (D
h
, D
h
) T(L
n
++
). As in the convex and semidenite program-
ming cases, one can easily handle the more general case with a nonempty C =
x R
n
[ Ax b L
m
++
, where A R
mn
, b R
m
, by choosing h(x) :=
log(Ax b)
T
J
m
(Ax b).
708 ALFRED AUSLENDER AND MARC TEBOULLE
We will now show that, interestingly, even for IPA which are not self-proximal,
the induced proximal distance H from the choice of d for various types of constraints
will still be a Bregman proximal distance D
h
with an appropriate convex kernel h in
the class B or JB.
3.3. Proximal functions based on -divergences.
-divergence kernels. Let : R R + be an lsc, convex, proper
function such that dom R
+
and dom = R
++
. We suppose in addition that
is C
2
, strictly convex, and nonnegative on R
++
with (1) =
(1) = 0. We denote
by the class of such kernels and by
1
the subclass of these kernels satisfying
(1)
_
1
1
t
_
(t)
(1)
_
1
1
t
_
(t)
1
(t) = t log t t + 1, dom = [0, +),
2
(t) = log t + t 1, dom = (0, +),
3
(t) = 2(
t 1)
2
, dom = [0, +).
Corresponding to the classes
r
, with r = 1, 2, we dene a -divergence proximal
distance by
d
(x, y) =
n
i=1
y
r
i
_
x
i
y
i
_
.
For any , since argmin(t) [ t R = 1, is coercive and thus it follows
that d
T(C), with C = R
n
++
.
The use of -divergence proximal distances is particularly suitable for handling
polyhedral constraints. Let C = x R
n
[ Ax < b, where A is an (m, n) matrix
of full rank m (m n). Particularly important cases include C = R
n
++
or C =
x R
n
[ a
i
< x
i
< b
i
i = 1, . . . , n, with a
i
, b
i
R. For the sake of simplicity we
consider here only the case where C = R
n
++
. Indeed, as is already noted in several
works (e.g., [3, 4, 44]), since these proximal distances are separable they can thus
be extended without diculty to the polyhedral case by redening d in the form
d(x, y) :=
m
i=1
d
i
(b
i
a
i
, x), b
i
a
i
, y)), where a
i
are the rows of the matrix A and
d
i
(u
i
, v
i
) = v
r
i
(u
i
v
1
i
).
The class
1
. It turns out that the induced proximal distance H associ-
ated with d
n
j=1
x
j
log x
j
(obtained from
1
) and given by
D
h
(x, y) := K(x, y) =
n
j=1
x
j
log
x
j
y
j
+ y
j
x
j
x R
n
+
, y R
n
++
, (3.8)
which is the KullbackLiebler relative entropy. The fact that K plays a central role in
the analysis of IPA based on
1
was already realized in [28] and later formalized
INTERIOR GRADIENT AND PROXIMAL METHODS 709
in [44, Lemma 4.1(ii)], which shows that for any
1
one has
c b,
1
d
(b, a))
(1)K) T
+
(C) and all the convergence results of section 2 apply for the
corresponding IPA. We note parenthetically that the induced proximal distance K
can also be obtained from the -divergence with the kernel
1
. In fact this should not
be surprising, since it can be veried that d
= D
h
if and only if h(x) =
n
j=1
1
(x
j
).
Regularized class
1
. Let
1
and dene d(x, y) = d
(x, y)+2
1
|xy|
2
,
with > 0. This proximal distance was recently considered in [2] in the context of
LotkaVolterra dynamical systems with the choice =
2
. As shown there, one can
verify that with H(x, y) = K(x, y) +
2
|x y|
2
, one has (d
2
, H) T
+
(C).
The class
2
: Second order homogeneous proximal distances [5, 10, 45].
Let (t) = p(t) +
2
(t 1)
2
with > 0, p
2
, and let the associated proximal
distance be dened by
d
(x, y) =
n
j=1
y
2
j
_
x
j
y
j
_
.
In particular, p(t) = log t +t 1 gives the so-called logarithmic-quadratic proximal
distance [5]. Obviously d
T(C), and from the key inequality [4, Lemma 3.4] one
has
c b,
1
d(b, a)) (|c a|
2
|c b|
2
) a, b R
++
, c R
+
with = 2
1
( + ). Therefore with H(x, y) = |x y|
2
it follows that (d
, H)
T
+
(C).
4. Interior gradient methods. When C = R
n
, Correa and Lemarechal [17]
and Robinson [38] have remarked that the PA can be viewed as an -subgradient
descent method. This idea was recently extended by Auslender and Teboulle [7] for
the logarithmic-quadratic proximal method which allows us to handle linear inequality
constraints directly. Given the framework developed in section 2, we extend these
results for more general constraints and with various classes of proximal distances.
We rst give the main convergence result. We then present applications and
examples which allow us to improve some known interior gradientbased methods
as well as to derive new and simple convergent algorithms for conic optimization
problems.
4.1. A general convergence theorem. To solve problem (P) inff(x) [ x
C we consider the following general projected subgradient-based algorithm (PSA).
Take d T(C). Let
k
> 0,
k
0, and m (0, 1], and for k 1 generate the
sequence x
k
, g
k
such that
x
k1
C, g
k1
k
f(x
k1
), (4.1)
x
k
argmin
k
g
k1
, x) + d(x, x
k1
) [ x C, (4.2)
f(x
k
) f(x
k1
) + m(g
k1
, x
k
x
k1
)
k
). (4.3)
Let us briey recall why the sequence x
k
, constructed by the exact IPA (
k
= 0) in
section 2 via (2.3) and (2.4), ts in PSA (see, e.g., [17, 7] for more details). Starting
710 ALFRED AUSLENDER AND MARC TEBOULLE
IPA with x
0
C, one has x
k
C, and it can be veried that g
k
f(x
k
) is equivalent
to saying that
g
k
k
f(x
k1
) with
k
= f(x
k1
) f(x
k
) +g
k
, x
k
x
k1
) 0.
Therefore (2.3) and (2.4) are nothing else but (4.1) and (4.2). Then, with m = 1
and with
k
as dened above, inequality (4.3) holds as an equality, showing that the
sequence x
k
generated by IPA satises (4.1), (4.2), and (4.3).
Building on the material developed in section 2 it is now possible to establish
convergence results of PSA for various instances of the triple [C, d, H], extending
recent convergence results given in [7, Theorem 4.1]. Before doing so, we rst note
that by using the same arguments as in the proof of Proposition 2.1, it is easily seen
that the existence of x
k
C is guaranteed.
Theorem 4.1. Let x
k
be a sequence generated by PSA with (d, H) T(C).
Set
n
=
n
k=1
k
and
k
= g
k1
, x
k1
x
k
). Then,
(i)
k=1
k
< ,
k=1
k
< , and
k
1
k
H(x
k
, x
k1
) 0 k N.
(ii) z C, f(x
n
) f(z)
1
n
[H(z, x
0
) +
n
k=1
k
(
k
+
k
)].
(iii) The sequence f(x
k
) is nonincreasing and converges to f
as
n
.
(iv) Suppose that the optimal set X
is nonempty and
n
. Then the se-
quence x
k
is bounded with all its limit points in X
is bounded.
(b) (d, H) T(C) and
k=1
k
k
< + (which in particular is true if
k
is bounded above).
In addition, if (d, H) T
+
(C), then x
k
converges to an optimal solution
of (P).
Proof. (i) From the optimality conditions, (4.2) is equivalent to
k
g
k1
+
1
d(x
k
, x
k1
) = 0.
Since H(, ) 0 and H(a, a) = 0, from (2.6) with c = a = x
k1
, b = x
k
we then
obtain
k
=
1
d(x
k
, x
k1
), x
k
x
k1
) H(x
k1
, x
k
) 0.
Furthermore, from (4.3) we obtain
m(
k
+
k
) f(x
k1
) f(x
k
), (4.4)
which also shows that f(x
k
) is nonincreasing. Summing over k = 1, . . . , n in the
last inequality it follows that
m
n
k=1
(
k
+
k
) f(x
0
) f(x
n
) f(x
0
) f
, (4.5)
proving (i). Now since
n
=
n
k=1
k
, using
k
=
k
+
k1
(with
0
= 0), multiplying
(4.4) by
k1
, and summing over k = 1, . . . n, we obtain
n
k=1
[(
k
k
)f(x
k
)
k1
f(x
k1
)] 0,
INTERIOR GRADIENT AND PROXIMAL METHODS 711
which reduces to
n
f(x
n
)
n
k=1
k
f(x
k
) 0. (4.6)
Now, since g
k1
k
f(x
k1
), then for any z C one has
f(z) f(x
k1
) +
k
g
k1
, z x
k1
)
= g
k1
, z x
k
) +g
k1
, x
k
x
k1
)
=
1
k
z x
k
,
1
d(x
k
, x
k1
))
k
k
[H(z, x
k
) H(z, x
k1
)]
k
,
where the last inequality uses (2.6) with b = x
k
, a = x
k1
. Since f(x
k
) f(x
k1
), it
then follows that
k
(f(x
k
) f(z)) H(z, x
k1
) H(z, x
k
) +
k
(
k
+
k
).
Summing the above inequality over k = 1, . . . , n, we obtain
n
f(z) +
n
k=1
k
f(x
k
) H(z, x
0
) H(z, x
n
) +
n
k=1
k
(
k
+
k
).
Adding this inequality to (4.6) and dividing by
n
one obtains
f(x
n
) f(z)
H(z, x
0
)
n
+
n
k=1
k
(
k
+
k
)
n
z C.
This proves (ii). Suppose
n
. Since the sequences
k
and
k
converge to 0,
invoking Lemma 2.2 and passing to the limit we obtain
lim
n
f(x
n
) = limsup
n
f(x
n
) inff(x) [ x C = f
,
proving (iii). The rest of the proof is exactly the same as in the proof of Theorems
2.1 and 2.2.
Using (ii) of Theorem 4.1 with (4.5), we obtain the following corollary.
Corollary 4.1. Let (d, H) T(C) and let x
k
be the sequence produced by
PSA. Suppose that X
,= and 0 <
= O(n
1
).
4.2. Conic optimization: Interior projected gradient methods with
strongly convex proximal distance.
4.2.1. Preliminaries. We consider now the problem
(M) inff(x) [ x C 1,
where 1 = x : Ax = b, with b R
m
, A R
mn
, n m, f : R
n
R + is
convex and lsc, and we assume that x
0
domf C : Ax
0
= b.
When C is a convex cone, problem (M) is the standard conic optimization problem
(see, e.g., [33]), while whenever 1 = R
n
it is just a pure conic optimization problem.
712 ALFRED AUSLENDER AND MARC TEBOULLE
In the following subsection, we assume also that f is continuously dierentiable with
f Lipschitz on C 1 and Lipschitz constant L, i.e., L > 0 such that
|f(x) f(y)| L|x y| x, y C 1. (4.7)
We consider now (d, H) T(C) such that d satises the following properties:
(s1) > 0 : y C 1, d(, y) is -strongly convex over C 1, i.e.,
1
d(x
1
, y)
1
d(x
2
, y), x
1
x
2
) |x
1
x
2
|
2
x
1
, x
2
C 1, (4.8)
for some norm | | in R
n
.
(s2) y C 1, d(, y) is C
2
on C with Hessian function denoted by
2
1
d(, y).
Therefore with the same arguments as the ones given in the proof of Proposi-
tion 2.1, it follows that for each x C 1, for each v R
n
there exists a unique (by
strong convexity) point u(v, x) C 1 solving
u(v, x) = argminv, z) + d(z, x) [ z 1. (4.9)
Then from the optimality conditions for the convex problem (4.9) (see, e.g., [39,
section 28]), := (v, x) R
m
such that
2
v + A
t
+
1
d(u(v, x), x) = 0, Au(v, x) = b. (4.10)
Clearly, problem (M) can be equivalently formulated in the form of problem (P) as
follows:
f
= minf
0
(x) [ x C with f
0
= f +
V
.
Dene 1
0
= x : Ax = 0. Note that for any w 1 one has f(w) = f
0
(w) and
(, x) := (f(x) + A
t
) f
0
(x) x C 1, R
m
. (4.11)
Indeed, for any z, x 1 we have z x 1
0
and thus, for any R
m
,
f
0
(z) = f(z) f(x) +f(x), z x) = f
0
(x) +f(x) + A
t
, z x)
= f
0
(x) +(, x), z x).
Since for z / 1 this inequality obviously holds, (4.11) is veried.
4.2.2. Algorithms. We can now propose for solving problem (M) the basic
iteration of our algorithm. Given a step-size rule for choosing
k
at each step k,
starting from a point x
0
C 1 we generate iteratively the sequence x
k
C 1 by
the relation
x
k
= u(
k
f(x
k1
), x
k1
). (4.12)
As a consequence of the above discussion relations (4.1) and (4.2) are satised with
f replaced by f
0
,
k
= 0, and
g
k1
=
_
(
k
f(x
k1
), x
k1
)
k
, x
k1
_
f
0
(x
k1
). (4.13)
2
Note that the rst relation in the optimality condition can be rewritten equivalently as v +
1
d(u(v, x), x) V
0
, where V
0
= {x : Ax = 0}.
INTERIOR GRADIENT AND PROXIMAL METHODS 713
We propose now two step-size rules, and for each rule we will show that inequality (4.3)
holds and
k=1
k
= . As a consequence we will be able to apply Theorem 4.1 and
then to devise two convergent interior gradient projection algorithms which naturally
extend the results of Auslender and Teboulle [7].
Algorithm 1 (constant step-size rule). Let ]0, 1[ and set
:= 2L
1
,
(0,
0
, stop. Otherwise, compute
x
k
= x
k
(
k
) := u(
k
f(x
k1
), x
k1
) with
k
(
]. (4.14)
Theorem 4.2. Let x
k
be the sequence produced by Algorithm 1. If at step k
one has f(x
k1
) 1
0
, then x
k1
is an optimal solution. Otherwise, the sequence
f(x
k
) is nonincreasing and converges to f
is nonempty; then
(a) if X
;
(b) if (d, H) T
+
(C), the sequence x
k
converges to an optimal solution of (P).
Proof. First, if f(x
k1
) 1
0
, since x
k1
C 1 then obviously, from the
optimality conditions (4.10), it follows that x
k1
is also an optimal solution. Suppose
now that f(x
k1
) / 1
0
. Since
k
, then
n
=
n
k=1
k
. Thus, it remains
to show (4.3), and our result would follow as a direct consequence of Theorem 4.1.
Since f is Lipschitz, by the well-known descent lemma (see, e.g., [12, p. 667]) one
has
f(x
k
) f(x
k1
) +f(x
k1
), x
k
x
k1
) +
L
2
|x
k
x
k1
|
2
. (4.15)
Now, we rst remark that
(x
k
x
k1
) 1
0
. (4.16)
Then using (4.8), with x
1
= y = x
k1
C 1, x
2
= u(v, x
k1
) C 1, and
v =
k
f(x
k1
); (4.10); and g
k1
as dened in (4.13) (recalling that
1
d(y, y) = 0),
it follows that
k
g
k1
, x
k1
x
k
) =
k
f(x
k1
), x
k1
x
k
) |x
k
x
k1
|
2
.
This combined with (4.15) yields
f(x
k
) f(x
k1
) +x
k
x
k1
, g
k1
)
_
1
L
k
2
_
,
so that with f
0
(x
k
) = f(x
k
), f
0
(x
k1
) = f(x
k1
) we get
f
0
(x
k
) f
0
(x
k1
) +x
k
x
k1
, g
k1
)
_
1
L
k
2
_
.
Then with
=
2
L
, we get f
0
(x
k
) f
0
(x
k1
) + x
k
x
k1
, g
k1
)(1 ), showing
that (4.3) holds with m = 1 .
The second algorithm extends the method proposed in [7] and allows us to use
a generalized step-size rule, reminiscent of the one used in the classical projected
gradient method as studied by Bertsekas [11].
Algorithm 2 (ArmijoGoldstein step-size rule). Let (0, 1), m (0, 1),
and s > 0 be xed chosen scalars. Start from a point x
0
C 1 and generate
714 ALFRED AUSLENDER AND MARC TEBOULLE
the sequence x
k
C 1 as follows: if f(x
k1
) 1
0
stop. Otherwise, with
x
k
() = u(f(x
k1
), x
k1
), set
k
=
j
k
s, where j
k
is the rst nonnegative integer j
such that
f(x
k
(
j
s)) f(x
k1
) mf(x
k1
), x
k
(
j
s) x
k1
). (4.17)
Then set x
k
= x
k
(
k
).
In order to show that this step-size rule is well dened, we need the following
proposition.
Proposition 4.1. For any x C 1, any v R
n
, and > 0, the unique
solution u(v, x) dened by (4.9) satises u(0, x) = x and the following properties
hold:
(i) |x u(v, x)|
2
x u(v, x), v),
(ii)
u(v,x)x
1
|v|,
(iii) lim
0
+
u(v,x)x
0
(4.18)
with Q(x) =
2
1
d(x, x),
(iv) (v, x), v) |(v, x)|
2
.
Proof. Fix any x C 1. By (4.9), we have u(0, x) = argmind(z, x) [ z 1,
and thus by optimality conditions (4.10) with = 0 it follows that u(0, x) = x.
Furthermore, from (4.10) we have
v +
1
d(u(v, x), x), x u(v, x)) = 0,
from which the inequality in (i) follows immediately by using the strong convexity
inequality (4.8) at y = x
1
= x, x
2
= u(v, x), and (ii) follows from (i) and the
CauchySchwarz inequality.
(iii) Since d(, y) is strongly convex on C 1 it follows from (4.8) that
Q(x)h, h) |h|
2
h 1
0
. (4.19)
As a consequence of the LaxMilgram theorem (see, for example, [14, Corollary 5.8]),
(4.18) admits exactly one solution (v, x). Note that
1
d(x, x) = 0. Then, since by
(4.10) we have
v +
1
d(u(v, x), x) + A
t
(v, x) = 0,
it follows that
h 1
0
:
1
1
d(u(v, x), x)
1
d(x, x), h) = v, h).
Denote s() :=
u(v,x)x
2
1
d(x, x)u, h) = Q(x)u, h) = v, h) h 1
0
,
which is equivalent to (4.18). As a consequence u = (v, x) and lim
0
+ s() exists
and is equal to (v, x). To prove (iv), take h = (v, x) 1
0
in the last equality and
use (4.19).
INTERIOR GRADIENT AND PROXIMAL METHODS 715
We can now prove the convergence of Algorithm 2.
Theorem 4.3. Let x
k
be the sequence generated by Algorithm 2. If at step k
one has f(x
k1
) 1
0
, then x
k1
is an optimal solution. Otherwise, the algorithm is
well dened, i.e., there exists an integer j
k
such that
k
=
j
k
, and the sequence
k
is bounded below by
= min(2L
1
(1 m), s) > 0. Furthermore, Theorem 4.2
holds for the sequence produced by Algorithm 2.
Proof. We have only to prove that the algorithm is well dened and that
k
0
, since x C 1, then obviously, from optimality conditions (4.10), it
follows that x is also an optimal solution. Suppose now that v / 1
0
and that (4.17)
does not hold. That is,
f(x(
j
s)) f(x) > mx(
j
s) x, v) j N. (4.20)
Invoking the mean value theorem, z
j
]x, x(
j
s)[ such that
_
f(z
j
),
x(
j
s) x
j
s
_
> m
_
x(
j
s) x
j
s
, v
_
j N.
But by Proposition 4.1(i) it follows that lim
j
z
j
= x. Moreover, passing to the
limit in the last inequality and using (iii) and (iv) of the same proposition, we obtain
(1 m)|(v, x)|
2
(1 m)v, (v, x)) 0,
which implies that (v, x) = 0, and hence by (4.18) it follows that v 1
0
, and we
have reached a contradiction. Now let us prove that
k
1
does not satisfy (4.17), then
1
> 2L
1
(1 m), it follows that
k
n
j=1
x
r
j
(x
1
j
z
j
), > 0 for (z, x) C C.
Take for example (t) = log t + t 1 and r = 2, namely the log-quad function.
Then, (4.21) can be written as
d(z, x) =
n
j=1
x
2
j
(x
1
j
z
j
) with (t) =
2
(t 1)
2
+ (t log t 1).
Solving (4.9), one easily obtains (see also [7, eq. (2.3), p. 4]) the following explicit
formulas:
i = j, . . . , n, u
j
(v, x) = x
j
(
(v
j
x
1
j
)
with (
(s) = (2)
1
( ) + s +
_
(( ) + s)
2
+ 4.
In the case r = 1, (4.9) reduces to solve the equation in z u(v, x) > 0 given by
v + (1 x
j
z
1
j
) + (z
j
x
j
) = 0, j = 1, . . . , n.
A simple calculation then yields the unique positive solution of this quadratic equa-
tion:
u
j
(v, x) = (2)
1
_
x
j
v
j
+
_
(x
j
v
j
)
2
+ 4x
j
_
j = 1, . . . , n.
B. Semidenite programming, C = S
n
++
. Take (as in section 3.2)
p(x, y) = tr(log x + log y + xy
1
) n x, y S
n
++
= + otherwise,
which is obtained from the Bregman kernel h : S
n
++
R dened by h(x) = ln det(x).
Using the fact that h(x) = x
1
, the optimality conditions for (4.9) allow us to
solve for z u(v, x) the matrix equation
z z
1
= with := x v x
1
.
A direct calculation shows that the matrix
u(v, x) = (2)
1
( +
_
2
+ 4I) x S
n
++
, v S
n
(where I denotes the n n identity matrix) is the unique solution of this equation,
with u(v, x) S
n
++
, since its eigenvalues are positive.
C. Second order cone programming, C = L
n
++
. As in section 3, we
take h
(x) = log(x
T
J
n
x) +
2
|x|
2
, with J
n
a diagonal matrix with its rst (n 1)
entries being 1 and the last entry being 1. Consider the associated Bregman distance
D D
h
(as given by (3.5), with 2 > 0, the multiplication by 2 being just for
computational convenience):
D(x, y) = log
x
T
J
n
x
y
T
J
n
y
+
2x
T
J
n
y
y
T
J
n
y
2 + |x y|
2
x, y L
n
++
.
INTERIOR GRADIENT AND PROXIMAL METHODS 717
Moreover, we use the following notation. For any R
n
, we set () :=
T
J
n
and
we write := (
,
n
) R
n1
R. Writing the optimality conditions for (4.9), we have
to nd the unique solution u(v, x) z L
n
++
(namely with (z) > 0) solving
v +h(z) h(x) + 2(z x) = 0. (4.22)
Using h(z) = 2(z)
1
J
n
z and dening w := (h(x) + 2x v)/2 := ( w, w
n
)
R
n1
R, (4.22) reduces to
z (z)
1
J
n
z = w. (4.23)
Decomposing (4.23) in the product space R
n1
R yields
z + (z)
1
z = w, z
n
(z)
1
z
n
= w
n
, (4.24)
and by eliminating (z) > 0 from these last two equations we obtain
(2z
n
w
n
) z = z
n
w (2 z w)z
n
= w
n
z. (4.25)
Now, multiplying (4.23) by z, we obtain |z|
2
w
T
z 1 = 0, which after completing
the square can be rewritten as |2 z w|
2
+(2z
n
w
n
)
2
= |w|
2
+4. Using (4.25)
and dening := 2z
n
w
n
, the last equation reads
w
2
n
| w|
2
2
+
2
= |w|
2
+ 4. (4.26)
Now, it is easy to verify that > 0. Indeed, since z L
n
++
, then z
n
> 0, and by
(4.24) one also has w
n
< z
n
, and it follows that = 2z
n
w
n
> z
n
w
n
> 0.
Out of the two remaining solutions of (4.26), a direct computation (using the fact
that (|w|
2
+4)
2
4w
2
n
| w|
2
= (w
2
n
| w|
2
+4)
2
+16| w|
2
) shows that the unique
positive solution of (4.26) that will warrant (z) > 0 is given by the following:
=
_
|w|
2
+ 4 +
_
(|w|
2
+ 4)
2
4w
2
n
| w|
2
2
_
1/2
.
Therefore using (4.25) it follows that the unique solution u z L
n
++
of (4.22) is
given by z = ( z, z
n
) with
z =
z
n
w =
1
2
_
1 +
w
n
_
w, z
n
=
1
2
(w
n
+ ). (4.27)
Remark 4.1. It is worthwhile to mention that an alternative derivation of (4.27)
could also have been obtained by using properties and facts on Jordan algebra asso-
ciated with the second order cone; see, e.g., [22, 23].
D. Convex minimization over the unit simplex. An interesting special
case of a conic optimization, with 1 ,= R
n
, where u(v, x) can be explicitly given,
and where all this theory applies, is when C = R
n
+
and A = e
T
, b = 1, i.e.,
1 = x R
n
[
n
j=1
x
j
= 1, so that problem (M) reduces to a convex mini-
mization problem over the unit simplex = x R
n
[
n
j=1
x
j
= 1, x 0. This
problem arises in important applications. In [9], Ben-Tal, Margalit, and Nemirovski
demonstrated that an algorithm based on the mirror descent (MDA) can be success-
fully used to solve very large-scale instances of computerized tomography problems,
718 ALFRED AUSLENDER AND MARC TEBOULLE
modeled through (M). Recently, Beck and Teboulle [8] have shown that the MDA
can be viewed as a projection subgradient algorithm with strongly convex Bregman
proximal distances. As a result, to handle the simplex constraints , they proposed
to use a Bregman proximal distance based on the entropy kernel
(x) =
_
n
j=1
x
j
log x
j
if x ,
+ otherwise
(4.28)
to produce an entropic mirror descent algorithm (EMDA). It was shown in [8] that
the EMDA preserved the same computational eciency as the MDA (grows slowly
with the dimension of the problem), but has the advantage of being given explicitly
by a simple formula, since the problem
u(v, x) = argmin
z
v, z) + D
(z, x) (4.29)
can be easily solved analytically and yields
u
j
(v, x) =
x
j
exp(v
j
)
n
i=1
x
i
exp(v
i
)
, j = 1, . . . , n. (4.30)
The resulting EMDA of [8] was then dened as follows: for each j = 1, . . . , n with
v
j
=
f
xj
(x
k1
),
x
k
j
(
k
) = u
j
(
k
v, x
k1
) =
x
k1
j
exp (
k
f
xj
(x
k1
))
n
i=1
x
k1
i
exp (
k
f
xi
(x
k1
))
, (4.31)
k
=
2 log k
L
f
k
, (4.32)
where the objective function was supposed to be Lipschitz on and L
f
is the Lipschitz
constant.
We can modify the EMDA with an ArmijoGoldstein step-size rule. Such a
version of the EMDA can be more practical, since we do not need to know/compute
the constant L
f
. Indeed, it is well known (see, e.g., [8]) that
(x) (y), x y) |x y|
2
1
x, y
+
=
x R
n
j=1
x
j
= 1, x > 0
,
namely is 1-strongly convex with respect to the norm | |
1
, and hence so is d = H =
D
k
:=
k1
l=0
(1
l
). (5.3)
Thus, if at step k we have a sequence x
k
C1 such that f(x
k
) inf
zCV
q
k
(z) :=
q
k
, assuming that the optimal solution set X
)
k
(q
0
(x
) f(x
)). (5.4)
From the latter inequality it follows that if
k
0, then the sequence x
k
is a
minimizing sequence for f and the convergence rate of f(x
k
) to f(x
) is measured by
the magnitude of
k
. Therefore to construct algorithms based on the above scheme
which was proposed in [34] we need
to generate an appropriate sequence of functions q
k
(),
to guarantee that at each iteration k one can guarantee
f(x
k
) min
zCV
q
k
(z) := q
k
.
We begin by constructing the sequence of functions q
k
(). For that purpose, we
take here d H T(C), where H is a Bregman proximal distance (cf. (3.1)) with
kernel h such that
(h1) domh = C,
(h2) h is -strongly convex on C 1.
720 ALFRED AUSLENDER AND MARC TEBOULLE
For every k 0 and for any x C 1, we construct the sequence q
k
(x)
recursively via
q
0
(x) = f(x
0
) + cH(x, x
0
), (5.5)
q
k+1
(x) = (1
k
)q
k
(x) +
k
l
k
(x, y
k
), (5.6)
l
k
(x, y
k
) = f(y
k
) +x y
k
, f(y
k
)). (5.7)
Here, c > 0 and
k
[0, 1). The point x
0
is chosen such that x
0
C 1, while the
point y
k
C is arbitrary and will be generated in a specic way later. We rst show
that the sequence of functions q
k
() satises (5.1).
Lemma 5.1. The sequence q
k
(x) dened by (5.5)(5.7) satises
q
k+1
(x) f(x) (1
k
)(q
k
(x) f(x)) x C 1.
Proof. Since f is convex, we have f(x) l
k
(x, y
k
) x C 1, and together with
(5.6) we thus obtain
q
k+1
(x) (1
k
)q
k
(x) +
k
f(x) x C 1,
from which the desired result follows.
Using the notation of section 4, we recall that for each z C 1, for each v R
n
there exists a unique (by strong convexity of H(, z)) point u(v, z) C 1 solving
u(v, z) = argminv, x) + H(x, z) [ x C 1. (5.8)
The next result is crucial and shows that the sequence q
k
() admits a simple generic
form.
Lemma 5.2. For any k 0, one has
q
k
(x) = q
k
+ c
k
H(x, z
k
) x C 1 (5.9)
with
z
k
= argmin
xCV
q
k
(x), q
k
= q
k
(z
k
), c
0
= c, z
0
= x
0
C 1. (5.10)
Furthermore, the sequence z
k
C 1 is uniquely dened by
z
k+1
= argmin
__
x,
k
c
k+1
f(y
k
)
_
+ H(x, z
k
)
x C 1
_
u
_
k
c
k+1
f(y
k
), z
k
_
,
(5.11)
where the positive sequence c
k
satises c
k+1
= (1
k
)c
k
.
Proof. The proof is by induction and will use key identity (3.2). For k = 0, since
z
0
= x
0
by (5.5), one has q
0
(x) = f(x
0
) + cH(x, z
0
). Then since c
1
H(z
0
, z
0
) = 0
(recall the properties of H), and since z
0
C 1, the optimality conditions imply
that z
0
= argmin
xCV
q
0
(x). Now suppose that (5.9) holds for some k and let us
prove that for any x C 1,
q
k+1
(x) = q
k+1
+ c
k+1
H(x, z
k+1
). (5.12)
Substituting (5.9) into (5.6) and using c
k+1
= (1
k
)c
k
, one obtains
q
k+1
(x) = (1
k
)q
k
+ c
k+1
H(x, z
k
) +
k
l
k
(x, y
k
). (5.13)
INTERIOR GRADIENT AND PROXIMAL METHODS 721
Then by denition of z
k+1
we have
z
k+1
= argmin
xCV
q
k+1
(x) = u
_
k
c
k+1
f(y
k
), z
k
_
with z
k+1
C 1, and
q
k+1
= q
k+1
(z
k+1
) = (1
k
)q
k
+ c
k+1
H(z
k+1
, z
k
) +
k
l
k
(z
k+1
, y
k
). (5.14)
Subtracting (5.14) from (5.13), one obtains, using (5.7),
q
k+1
(x) = q
k+1
+ c
k+1
[H(x, z
k
) H(z
k+1
, z
k
)] +
k
[l
k
(x, y
k
) l
k
(z
k+1
, y
k
)]
= q
k+1
+ c
k+1
[H(x, z
k
) H(z
k+1
, z
k
)] +
k
z
k+1
x, f(y
k
)). (5.15)
Now, since z
k+1
= argmin
xCV
q
k+1
(x), then writing the optimality conditions for
(5.13) (recalling the properties of H) yields
c
k+1
1
H(z
k+1
, z
k
), z
k+1
x) =
k
f(y
k
), z
k+1
x) x C 1. (5.16)
Using (5.16) in (5.15), it follows that for any x C 1,
q
k+1
(x) = q
k+1
+ c
k+1
[H(x, z
k
) H(z
k+1
, z
k
) +z
k+1
x,
1
H(z
k+1
, z
k
))]. (5.17)
Invoking the identity (3.2) at c = x, b = z
k+1
, and a = z
k
, the right-hand side of
(5.17) reduces to q
k+1
(x) = q
k+1
+ c
k+1
H(x, z
k+1
), and the lemma is proved.
The next result is fundamental to determining the main steps of the algorithm,
namely the formulas needed to update the sequence x
k
and to determine the choice
of the intermediary point y
k
.
Theorem 5.1. Let > 0, L > 0 be given. Suppose that for some k 0 we have
a point x
k
C 1 such that f(x
k
) q
k
= minq
k
(x) : x C 1. Let
k
[0, 1),
c
k+1
= (1
k
)c
k
, and C 1 z
k
be given by (5.11). Dene
y
k
= (1
k
)x
k
+
k
z
k
, (5.18)
x
k+1
= (1
k
)x
k
+
k
z
k+1
. (5.19)
Then, the following inequality holds:
q
k+1
f(x
k+1
) +
1
2
_
c
k+1
2
k
L
_
|x
k+1
y
k
|
2
.
Proof. Let x C 1. Since q
k
(x) = q
k
+ c
k
H(x, z
k
), then by (5.6) and using
c
k+1
= (1
k
)c
k
one has
q
k+1
(x) = (1
k
)q
k
+ c
k+1
H(x, z
k
) +
k
l
k
(x, y
k
),
and with z
k+1
= argmin
xCV
q
k+1
(x) one obtains
q
k+1
(z
k+1
) = q
k+1
= (1
k
)q
k
+ c
k+1
H(z
k+1
, z
k
) +
k
l
k
(z
k+1
, y
k
). (5.20)
Under our assumption, we have q
k
f(x
k
), and thus using the gradient inequality
for f we have
q
k
f(x
k
) f(y
k
) +x
k
y
k
, f(y
k
)),
722 ALFRED AUSLENDER AND MARC TEBOULLE
and it follows from (5.20) and (5.7) that
q
k+1
f(y
k
) + c
k+1
H(z
k+1
, z
k
) +f(y
k
), r
k
), (5.21)
where r
k
=
k
(z
k+1
y
k
) + (1
k
)(x
k
y
k
). Noting that r
k
can be written as
r
k
= (1
k
)x
k
+
k
z
k
y
k
+
k
(z
k+1
z
k
),
and since by denition one has (1
k
)x
k
+
k
z
k
y
k
= 0, then (5.21) reduces to
q
k+1
f(y
k
) + c
k+1
H(z
k+1
, z
k
) +
k
(z
k+1
z
k
), f(y
k
)). (5.22)
Using the denition of y
k
, x
k+1
C 1 given in (5.18)(5.19), one has x
k+1
y
k
=
k
(z
k+1
z
k
). Since by hypothesis (h2) h is -strongly convex, it follows that
H(z
k+1
, z
k
) /2|z
k+1
z
k
|
2
, and then from (5.22) we have obtained
q
k+1
f(y
k
) +
1
2
c
k+1
2
k
|x
k+1
y
k
|
2
+f(y
k
), x
k+1
y
k
). (5.23)
Now, since we assumed that f in C
1,1
(C 1), then by the descent lemma (cf. (4.15))
we have
f(y
k
) +x
k+1
y
k
, f(y
k
)) f(x
k+1
)
L
2
|x
k+1
y
k
|
2
. (5.24)
Combining the latter inequality with (5.23) we obtain
q
k+1
f(x
k+1
) +
1
2
_
c
k+1
2
k
L
_
|x
k+1
y
k
|
2
.
Therefore by taking a sequence
k
with c
k+1
L
2
k
we can guarantee that
q
k+1
f(x
k+1
). In particular, we can choose L
2
k
= c
k
(1
k
), and this leads to
the following improved interior gradient algorithm.
Improved interior gradient algorithm (IGA).
Step 0. Choose a point x
0
C 1 and a constant c > 0. Dene z
0
= x
0
= y
0
,
c
0
= c, = L
1
.
Step k. For k 0, compute the following:
k
=
_
(c
k
)
2
+ 4c
k
c
k
2
,
y
k
= (1
k
)x
k
+
k
z
k
,
c
k+1
= (1
k
)c
k
,
z
k+1
= argmin
xCV
__
x,
k
c
k+1
f(y
k
)
_
+ H(x, z
k
)
_
= u
_
k
c
k+1
f(y
k
), z
k
_
,
x
k+1
= (1
k
)x
k
+
k
z
k+1
.
Note that the computational work of this algorithm is exactly the same as that of the
interior gradient method in section 4 via the computation of z
k+1
, since the remaining
steps involve trivial computations. To estimate the rate of convergence we need the
following simple lemma on the sequence
k
; see [34] for a proof.
INTERIOR GRADIENT AND PROXIMAL METHODS 723
Lemma 5.3. Let
k
> 0, c
k
> 0 with c
0
= c, and let
k
be the sequence
with
k
[0, 1[ dened by
2
k
=
k
c
k
(1
k
) with c
k+1
= (1
k
)c
k
. Set
k
:=
k1
l=0
(1
l
). Then
k
_
1 +
c
2
k1
l=0
_
l
_
2
.
In particular, with
l
= l we have
k
4(k
c + 2)
2
.
We thus obtain a convergent interior gradient method with an improved conver-
gence rate estimate.
Theorem 5.2. Let x
k
, y
k
be the sequences generated by IGA and let x
be
an optimal solution of (P). Then for any k 0 we have
f(x
k
) f(x
)
4L
k
2
c
C(x
, x
0
) = O
_
1
k
2
_
,
where C(x
, x
0
) = c
0
H(x
, x
0
) +f(x
0
) f(x
).
Proof. By Lemma 5.1, the sequence of functions q
k
() satises (5.1) and thus
(5.4) holds; i.e., using (5.5) we have
f(x
k
) f(x
)
k
(q
0
(x
) f(x
)) =
k
(f(x
0
) + c
0
H(x
, x
0
) f(x
)) =
k
C(x
, x
0
).
Specializing Lemma 5.3 with
k
= L
1
, we obtain
k
4L
(k
c + 2
L)
2
4L
ck
2
,
from which the desired result follows.
Thus, to solve (P) to accuracy > 0, one needs no more than O(1/
)| iterations
of IGA, which is a signicant reduction (by a squared root factor) in comparison to
the interior gradient method of section 4. In particular, we note that IGA can be
used to solve convex minimization over the unit simplex with this improved global
convergence rate estimate for the EMDA of section 4.
REFERENCES
[1] F. Alvarez, J. Bolte, and O. Brahic, Hessian Riemannian gradient ows in convex pro-
gramming, SIAM J. Control Optim., 43 (2004), pp. 477501.
[2] H. Attouch and M. Teboulle, A regularized Lotka Volterra dynamical system as a continuous
proximal-like method in optimization, J. Optim. Theory Appl., 121 (2004), pp. 541570.
[3] A. Auslender and M. Haddou, An interior proximal method for convex linearly constrained
problems and its extension to variational inequalities, Math. Program., 71 (1995), pp.
77100.
[4] A. Auslender, M. Teboulle, and S. Ben-Tiba, A logarithmic-quadratic proximal method for
variational inequalities, Comput. Optim. Appl., 12 (1999), pp. 3140.
[5] A. Auslender, M. Teboulle, and S. Ben-Tiba, Interior proximal and multiplier methods
based on second order homogeneous kernels, Math. Oper. Res., 24 (1999), pp. 645668.
[6] A. Auslender and M. Teboulle, Asymptotic Cones and Functions in Optimization and
Variational Inequalities, Springer Monogr. Math., Springer-Verlag, New York, 2003.
[7] A. Auslender and M. Teboulle, Interior gradient and epsilon-subgradient methods for con-
strained convex minimization, Math. Oper. Res., 29 (2004), pp. 126.
[8] A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods for
convex optimization, Oper. Res. Lett., 31 (2003), pp. 167175.
724 ALFRED AUSLENDER AND MARC TEBOULLE
[9] A. Ben-Tal, T. Margalit, and A. Nemirovski, The ordered subsets mirror descent optimiza-
tion method with applications to tomography, SIAM J. Optim., 12 (2001), pp. 79108.
[10] A. Ben-Tal and M. Zibulevsky, Penalty/barrier methods for convex programming problems,
SIAM J. Optim., 7 (1997), pp. 347366.
[11] D. P. Bertsekas, On the Goldstein-Levitin-Polyak gradient projection method, IEEE Trans.
Automat. Control, 21 (1976), pp. 174183.
[12] D. P. Bertsekas, Nonlinear Programming, 2nd ed., Athena Scientic, Belmont, MA, 1999.
[13] J. Bolte and M. Teboulle, Barrier operators and associated gradient-like dynamical systems
for constrained minimization problems, SIAM J. Control Optim., 42 (2003), pp. 12661292.
[14] H. Brezis, Analyse Fonctionnelle: Theorie et applications, Masson, Paris, 1987.
[15] Y. Censor and S. Zenios, The proximal minimization algorithm with D-functions, J. Optim.
Theory Appl., 73 (1992), pp. 451464.
[16] G. Chen and M. Teboulle, Convergence analysis of a proximal-like minimization algorithm
using Bregman functions, SIAM J. Optim., 3 (1993), pp. 538543.
[17] R. Correa and C. Lemarechal, Convergence of some algorithm for convex programming,
Math. Program., 62 (1993), pp. 261275.
[18] M. Doljansky and M. Teboulle, An interior proximal algorithm and the exponential multi-
plier method for semidenite programming, SIAM J. Optim., 9 (1998), pp. 113.
[19] J. Eckstein, Nonlinear proximal point algorithms using Bregman functions, with applications
to convex programming, Math. Oper. Res., 18 (1993), pp. 202226.
[20] J. Eckstein, Approximate iterations in Bregman-function-based proximal algorithms, Math.
Program., 83 (1998), pp. 113123.
[21] P. P. B. Eggermont, Multiplicatively iterative algorithms for convex programming, Linear
Algebra Appl., 130 (1990), pp. 2542.
[22] J. Faraut and A. Kor anyi, Analysis on Symmetric Cones, Oxford Math. Monogr., The
Claredon Press, Oxford University Press, New York, 1994.
[23] M. Fukushima, Z.-Q. Luo, and P. Tseng, Smoothing functions for second-order-cone com-
plementarity problems, SIAM J. Optim., 12 (2001), pp. 436460.
[24] O. G uler, On the convergence of the proximal point algorithm for convex minimization, SIAM
J. Control Optim., 29 (1991), pp. 403419.
[25] A. N. Iusem, Interior point multiplicative methods for optimization under positivity constraints,
Acta Appl. Math., 38 (1995), pp. 163184.
[26] A. N. Iusem, B. F. Svaiter, and M. Teboulle, Multiplicative interior gradient methods
for minimization over the nonnegative orthant, SIAM J. Control Optim., 34 (1996), pp.
389406.
[27] A. N. Iusem, B. Svaiter, and M. Teboulle, Entropy-like proximal methods in convex pro-
gramming, Math. Oper. Res., 19 (1994), pp. 790814.
[28] A. N. Iusem and M. Teboulle, Convergence rate analysis of nonquadratic proximal and
augmented Lagrangian methods for convex and linear programming, Math. Oper. Res., 20
(1995), pp. 657677.
[29] K. C. Kiwiel, Proximal minimization methods with generalized Bregman functions, SIAM J.
Control Optim., 35 (1997), pp. 11421168.
[30] B. Lemaire, The proximal algorithm, in New Methods in Optimization and Their Industrial
Uses, Internat. Schriftenreihe Numer. Math. 87, J. P. Penot, ed., Birkhauser, Basel, 1989,
pp. 7387.
[31] B. Martinet, Regularisation dinequations variationnelles par approximations successives,
Rev. Francaise Informat. Recherche Operationnelle, 4 (1970), pp. 154158.
[32] A. Nemirovski and D. Yudin, Problem Complexity and Method Eciency in Optimization,
John Wiley, New York, 1983.
[33] Y. Nesterov and A. Nemirovskii, Interior Point Polynomial Algorithms in Convex Program-
ming, SIAM, Philadelphia, 1994.
[34] Y. Nesterov, On an approach to the construction of optimal methods of minimization of
smooth convex functions,
`
Ekonom. i Mat. Metody, 24 (1988), pp. 509517.
[35] B. T. Polyak, Introduction to Optimization, Optimization Software, New York, 1987.
[36] R. A. Polyak, Nonlinear rescaling vs. smoothing technique in constrained optimization, Math.
Program., 92 (2002), pp. 197235.
[37] R. A. Polyak and M. Teboulle, Nonlinear rescaling and proximal-like methods in convex
optimization, Math. Program., 76 (1997), pp. 265284.
[38] S. M. Robinson, Linear convergence of epsilon subgradients methods for a class of convex
functions, Math. Program., 86 (1999), pp. 4150.
[39] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970.
[40] R. T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM J. Control
Optim., 14 (1976), pp. 877898.
INTERIOR GRADIENT AND PROXIMAL METHODS 725
[41] N. Z. Shor, Minimization Methods for Nondierentiable Functions, Springer-Verlag, Berlin,
1985.
[42] P. J. da Silva e Silva, J. Eckstein, and C. Humes, Jr., Rescaling and stepsize selection
in proximal methods using separable generalized distances, SIAM J. Optim., 12 (2001),
pp. 238261.
[43] M. Teboulle, Entropic proximal mappings with applications to nonlinear programming, Math.
Oper. Res., 17 (1992), pp. 670681.
[44] M. Teboulle, Convergence of proximal-like algorithms, SIAM J. Optim., 7 (1997), pp. 1069
1083.
[45] P. Tseng and D. P. Bertsekas, On the convergence of the exponential multiplier method for
convex programming, Math. Program., 60 (1993), pp. 119.