Lecture 17

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

SYS 6003: Optimization Fall 2016

Lecture 17
Instructor: Quanquan Gu Date: Oct 26th
Today we are going to study the projected gradient descent algorithm.
Consider the following contrained optimization problem :

min f (x). (1)


x∈D

If we apply the gradient descent algorithm directly, we cannot guarantee that in each iter-
ation, xt+1 = xt − ηt ∇f (xt ) will be in D. In other words, we may end up with infeasible
solutions. To ensure that the new point, xt+1 , that obtained in each iteration will be always
in D, one way is to project the new point back onto the feasible set.
Let us first define the projection of a point onto a set.
Definition 1 (Projection) The projection point of x onto a set C is defined as ΠC (x) :=
arg miny∈C 12 kx − yk22 .

Theorem 1 (Projection Theorem) Let C ⊆ Rd be a convex set. For any x ∈ Rd and


y ∈ C, it holds that
(1) (ΠC (x) − y)> (ΠC (x) − x) ≤ 0.;

(2) kΠC (x) − yk22 + kΠC (x) − xk22 ≤ kx − yk22 .

Proof: (1) Let f (y) = 21 kx − yk22 . By the first order necessary condition of local minimum
y∗ = ΠC (x), we have ∇f (y∗ )> d ≥ 0 where d is any feasible directions at y∗ . Let d =
y − ΠC (x). For any y ∈ C, it then follows that

∇f (y∗ )T (y − ΠC (x)) ≥ 0. (2)

Note that ∇f (y∗ ) = −(x − y∗ ) = y∗ − x and y∗ = ΠC (x). From (2), it then follows

(ΠC (x) − x)> (y − ΠC (x)) ≥ 0, i.e.,


(ΠC (x) − x)> (ΠC (x) − y) ≤ 0.

(2) We have

kx − yk22 = kx − ΠC (x) + ΠC (x) − yk22


= kx − ΠC (x)k22 + kΠC (x) − yk22 − 2(ΠC (x) − y)> (ΠC (x) − x)
≥ kx − ΠC (x)k22 + kΠC (x) − yk22 ,

where the inequality follows from part (1). This completes the proof.

Remark 1 Geometrically, the projection theorem says that the the angle between vectors
y − ΠC (x) and ΠC (x) − x is either acute or right.

1
Algorithm 1 Projected Gradient Descent
1: Input: ηt
2: Initialize: x1 ∈ D
3: for t = 1 to T − 1 do
4: xt+1 = ΠD [xt − ηt ∇f (xt )]
5: end for

So we modify the updating rule of gradient descent to be xt+1 = ΠD [xt − ηt ∇f (xt )],
where ΠD (x) is the projection of x onto D. Then we have the projected gradient descent
algorithm shown in Algorithm 1. It is worth noting that if the gradient of f does not exist
at xt , then in the fourth line of Algorithm 1, we can use any subgradient of f at xt instead
of its gradient.
The following theorem provides the convergence rate for the projected gradient descent
algorithm.

Theorem 2 Suppose that f is a convex function, and its subgradient g(x) is bounded by√G,
i.e., kg(x)k2 ≤ G, for any x ∈ D. Then for the projected gradient descent with ηt = 1/ t,
it holds that
 X T   2 
1 ∗ R 2 1
f xt − f (x ) ≤ +G √
T t=1 2 T

where x∗ is the optimal solution to problem (1) and R = maxx,y∈D kx − yk2 is the diameter
of the set convex D.

You might also like