Lecture 17
Lecture 17
Lecture 17
Lecture 17
Instructor: Quanquan Gu Date: Oct 26th
Today we are going to study the projected gradient descent algorithm.
Consider the following contrained optimization problem :
If we apply the gradient descent algorithm directly, we cannot guarantee that in each iter-
ation, xt+1 = xt − ηt ∇f (xt ) will be in D. In other words, we may end up with infeasible
solutions. To ensure that the new point, xt+1 , that obtained in each iteration will be always
in D, one way is to project the new point back onto the feasible set.
Let us first define the projection of a point onto a set.
Definition 1 (Projection) The projection point of x onto a set C is defined as ΠC (x) :=
arg miny∈C 12 kx − yk22 .
Proof: (1) Let f (y) = 21 kx − yk22 . By the first order necessary condition of local minimum
y∗ = ΠC (x), we have ∇f (y∗ )> d ≥ 0 where d is any feasible directions at y∗ . Let d =
y − ΠC (x). For any y ∈ C, it then follows that
Note that ∇f (y∗ ) = −(x − y∗ ) = y∗ − x and y∗ = ΠC (x). From (2), it then follows
(2) We have
where the inequality follows from part (1). This completes the proof.
Remark 1 Geometrically, the projection theorem says that the the angle between vectors
y − ΠC (x) and ΠC (x) − x is either acute or right.
1
Algorithm 1 Projected Gradient Descent
1: Input: ηt
2: Initialize: x1 ∈ D
3: for t = 1 to T − 1 do
4: xt+1 = ΠD [xt − ηt ∇f (xt )]
5: end for
So we modify the updating rule of gradient descent to be xt+1 = ΠD [xt − ηt ∇f (xt )],
where ΠD (x) is the projection of x onto D. Then we have the projected gradient descent
algorithm shown in Algorithm 1. It is worth noting that if the gradient of f does not exist
at xt , then in the fourth line of Algorithm 1, we can use any subgradient of f at xt instead
of its gradient.
The following theorem provides the convergence rate for the projected gradient descent
algorithm.
Theorem 2 Suppose that f is a convex function, and its subgradient g(x) is bounded by√G,
i.e., kg(x)k2 ≤ G, for any x ∈ D. Then for the projected gradient descent with ηt = 1/ t,
it holds that
X T 2
1 ∗ R 2 1
f xt − f (x ) ≤ +G √
T t=1 2 T
where x∗ is the optimal solution to problem (1) and R = maxx,y∈D kx − yk2 is the diameter
of the set convex D.