Karush-Kuhn-Tucker (KKT) Conditions: Lecture 11: Convex Optimization
Karush-Kuhn-Tucker (KKT) Conditions: Lecture 11: Convex Optimization
Karush-Kuhn-Tucker (KKT) Conditions: Lecture 11: Convex Optimization
f (x)
subject to:
gi (x) = 0 i = 1, . . . , m n
hj (x) 0 j = 1, . . . , p.
(1)
gi s are equality constraints and hj s are inequality constraints and usually they are assumed to be within
the class C 2 . A point that satisfies all constraints is said to be a feasible point. An inequality constraint is
said to be active at a feasible point x if hj (x) = 0 and inactive if hj (x) < 0. Equality constraints are always
active at any feasible point. To simplify notation we write h = [h1 , . . . , hp ] and g = [g1 , . . . , gm ], and the
constraints now become g(x) = 0 and h(x) 0.
= 0
= 0
(2)
(3)
= 0 (j = 1, . . . , p)
(4)
Convince yourself why the above conditions hold geometrically. It is convenient to introduce the Lagrangian
associated with the problem as
L(x, , ) = f (x) + T h(x) + T g(x)
where Rm , Rp and 0 are Lagrange multipliers. Note that equation (2), (3) and (4) together
give a total of n + m + p equations in the n + m + p variables x , and .
From now on we assume that we only have inequality constrains for simplicity. The case with equality
constraints can be done in a similar way, except that does not have the nonnegative constraint as . So in
our case we have the following optimization problem:
min f (x) s.t. h(x) 0.
x 0
and thus
sup inf L(x, ) inf sup L(x, ).
0 x
x 0
sup L(x , )
x 0
L(x , )
inf L(x, )
0 x
Thus we have
inf sup L(x, ) = sup inf L(x, ).
x 0
0 x
The point (x , ) is called the saddle point. One example is the function L(x, ) = x2 2 , with
saddle point (0, 0) as shown in Figure 1.
Weak duality always holds, and strong duality holds if f and hj s are convex and there exists at least one
feasible point which is an interior point. The Lagrange dual function D() is defined as
$
D() := inf L(x, ) = inf f (x) +
j hj (x)
x
x
j=1
1
n
(n
i=1 i
+ &w&2
s.t. yi (w xi + b) 1 i ; i 0 i
Saddle Point
Duality
8
7
6
(z)
L(x, )
2
4
0
3
2
2
p*
4
2
1
0
2
0
1
1.5
0.5
0.5
1.5
Figure 1: Left: Saddle point (0, 0) of L(x, ) = x2 2 ; Right: Geometric interpretation of duality.
Now the Lagrangian can be written as
L(w, b, , , ) =
n
n
n
$
$
1$
i i
i + wT w +
i (1 i yi wT xi yi b)
n i=1
i=1
i=1
where the Lagrange multiplers 0 and 0. We want to remove the primal variables w, b, by
maximization, i.e. set the following derivatives to zero:
n
L
=0
w
w=
L
=0
b
n
$
L
=0
1 $
i yi xi
2 i=1
i yi = 0
i=1
i + i =
1
.
n
n
$
i=1
1 $
i j yi yj xTi xj .
4 i,j
Since we have i 0 and i 0 and i + i = 1/n, thus we have 0 i 1/n. So the dual optimization
problem becomes
(n
1 (
T
max
i=1 i 4
i,j i j yi yj xi xj
s.t.
0 i 1/n.
which is a quadratic programming problem. Note that due to the constraints, the dual solution is in general
sparse, i.e. we have many #i s equal to 0. We have the following observations:
1. If i > 0: we have yi (wT xi + b) = 1 i 1. So the example is either at or on the wrong side of the
margin. Such examples for i > 0 are called support vectors.
2. If i = 0: we have i = 1/n and thus i = 0. So yi (wT xi + b) 1. Such examples are on the correct
side of the margin.
3. If yi (wT xi + b) < 1: we have > 0 and thus i = 0 and i = 1/n. So if an example causes margin
error then its dual variable i will take at the right boundary 1/n.
4. It is possible that for examples which are on the correct side of the margin, their i s are nonzero.
5. In the objective xi s appear always in the form of inner product xTi xj . So if we first map xi into a
feature vector (xi ), then we could replace xTi xj by )(xi ), (xj )*. This leads to the introduction of
reproducing kernel Hilbert space in SVM.