斯坦福大学机器学习数学基础 57-64

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

We will also assume knowledge of a feasible starting point x0 which satisfies all of our

constraints with strict inequality (as needed for Slater’s condition to hold).8
Recall that in our discussion of the Lagrangian-based formulation of the primal problem,

min max L(x, λ).


x λ:λi ≥0

we stated that the inner maximization, maxλ:λi ≥0 L(x, λ), was constructed in such a way
that the infeasible region of f was “carved away”, leaving only points in the feasible region
as candidate minima. The same idea of using penalties to ensure that minimizers stay in the
feasible region is the basis of barrier -based optimization. Specifically, if B(z) is the barrier
function
8
<0 z < 0
B(z) = :
∞ z ≥ 0,

then the primal problem is equivalent to


X
m
min f (x) + B(gi (x)). (4)
x
i=1

When gi (x) < 0, the objective of the problem is simply f (x); infeasible points are “carved
away” using the barrier function B(z).
While conceptually correct, optimization using the straight barrier function B(x) is nu-
merically difficult. To ameliorate this, the log-barrier optimization algorithm approximates
the solution to (4) by solving the unconstrained problem,

1X m
minimize f (x) − log(−gi (x)).
x t i=1

for some fixed t > 0. Here, the function −(1/t) log(−z) ≈ B(z), and the accuracy of the
approximation increases as t → ∞. Rather than using a large value of t in order to obtain
a good approximation, however, the log-barrier algorithm works by solving a sequence of
unconstrained optimization problems, increasing t each time, and using the solution of the
previous unconstrained optimization problem as the initial point for the next unconstrained
optimization. Furthermore, at each point in the algorithm, the primal solution points stay
strictly in the interior of the feasible region:

8
For more information on finding feasible starting points for barrier algorithms, see [1], pages 579-585.
For inequality-problems where the primal problem is feasible but not strictly feasible, primal-dual interior
point methods are applicable, also described in [1], pages 609-615.

10

55
Log-barrier optimization

• Choose µ > 1, t > 0.


• x ← x0 .
• Repeat until convergence:
1X m
(a) Compute x0 = min f (x) − log(−gi (x)) using x as the initial point.
x t i=1
0
(b) t ← µ · t, x ← x .

One might expect that as t increases, the difficulty of solving each unconstrained minimiza-
tion problem also increases due to numerical issues or ill-conditioning of the optimization
problem. Surprisingly, Nesterov and Nemirovski showed in 1994 that this is not the case
for certain types of barrier functions, including the log-barrier; in particular, by using an
appropriate barrier function, one obtains a general convex optimization algorithm which
takes time polynomial in the dimensionality of the optimization variables and the desired
accuracy!

4 Directions for further exploration


In many real-world tasks, 90% of the challenge involves figuring out how to write an opti-
mization problem in a convex form. Once the correct form has been found, a number of
pre-existing software packages for convex optimization have been well-tuned to handle dif-
ferent specific types of optimization problems. The following constitute a small sample of
the available tools:
• commerical packages: CPLEX, MOSEK
• MATLAB-based: CVX, Optimization Toolbox (linprog, quadprog), SeDuMi
• libraries: CVXOPT (Python), GLPK (C), COIN-OR (C)
• SVMs: LIBSVM, SVM-light
• machine learning: Weka (Java)
In particular, we specifically point out CVX as an easy-to-use generic tool for solving convex
optimization problems easily using MATLAB, and CVXOPT as a powerful Python-based
library which runs independently of MATLAB.9 If you’re interested in looking at some of the
other packages listed above, they are easy to find with a web search. In short, if you need a
specific convex optimization algorithm, pre-existing software packages provide a rapid way
to prototype your idea without having to deal with the numerical trickiness of implementing
your own complete convex optimization routines.
9
CVX is available at http://www.stanford.edu/∼boyd/cvx and CVXOPT is available at http://www.
ee.ucla.edu/∼vandenbe/cvxopt/.

11

56
Also, if you find this material fascinating, make sure to check out Stephen Boyd’s class,
EE364: Convex Optimization I, which will be offered during the Winter Quarter. The
textbook for the class (listed as [1] in the References) has a wealth of information about
convex optimization and is available for browsing online.

References
[1] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge UP, 2004.
Online: http://www.stanford.edu/∼boyd/cvxbook/

Appendix: The soft-margin SVM


To see the primal/dual action in practice, we derive the dual of the soft-margin SVM primal
presented in class, and corresponding KKT complementarity conditions. We have,

1 Xm
minimize kwk2 + C ξi
w,b,ξ 2 i=1
subject to y (i) (wT x(i) + b) ≥ 1 − ξi , i = 1, . . . , m,
ξi ≥ 0, i = 1, . . . , m.

First, we put this into our standard form, with “≤ 0” inequality constraints and no equality
constraints. That is,

1 2
X m
minimize kwk + C ξi
w,b,ξ 2 i=1
subject to 1 − ξi − y (i) (wT x(i) + b) ≤ 0, i = 1, . . . , m,
−ξi ≤ 0, i = 1, . . . , m.

Next, we form the generalized Lagrangian,10

1 2
Xm Xm
(i) T (i)
Xm
L(w, b, ξ, α, β) = kwk + C ξi + αi (1 − ξi − y (w x + b)) − βi ξi ,
2 i=1 i=1 i=1

which gives the primal and dual optimization problems:

max θD (α, β) where θD (α, β) := min L(w, b, ξ, α, β), (SVM-D)


α,β:αi ≥0,βi ≥0 w,b,ξ

min θP (w, b, ξ) where θP (w, b, ξ) := max L(w, b, ξ, α, β). (SVM-P)


w,b,ξ α,β:αi ≥0,βi ≥0

To get the dual problem in the form shown in the lecture notes, however, we still have a
little more work to do. In particular,
10
Here, it is important to note that (w, b, ξ) collectively play the role of the x primal variables. Similarly,
(α, β) collectively play the role of the λ dual variables used for inequality constraints. There are no “ν” dual
variables here since there are no affine constraints in this problem.

12

57
1. Eliminating the primal variables. To eliminate the primal variables from the dual
problem, we compute θD (α, β) by noticing that

θD (α, β) = minw,b,ξ L(w, b, ξ, α, β)

is an unconstrained optimization problem, where the objective function L(w, b, ξ, α, β)


ˆ minimize the Lagrangian,
is differentiable. Therefore, for any fixed (α, β), if (ŵ, b̂, ξ)
it must be the case that
X
m
ˆ α, β) = ŵ −
∇w L(ŵ, b̂, ξ, αi y (i) x(i) = 0 (5)
i=1
∂ Xm
ˆ
L(ŵ, b̂, ξ, α, β) = − αi y (i) = 0 (6)
∂b i=1
∂ ˆ α, β) = C − αi − βi = 0.
L(ŵ, b̂, ξ, (7)
∂ξi

Adding (6) and (7) to the constraints of our dual optimizaton problem, we obtain,
ˆ
θD (α, β) = L(ŵ, b̂, ξ)
1 Xm Xm X m
= kŵk2 + C ξˆi + αi (1 − ξˆi − y (i) (ŵT x(i) + b̂)) − βi ξˆi
2 i=1 i=1 i=1
1 Xm Xm Xm
= kŵk2 + C ξˆi + αi (1 − ξˆi − y (i) (ŵT x(i) )) − βi ξˆi
2 i=1 i=1 i=1
1 Xm
= kŵk2 + αi (1 − y (i) (ŵT x(i) )).
2 i=1

ˆ for fixed (α, β), the


where the first equality follows from the optimality of (ŵ, b̂, ξ)
second equality uses the definition of the generalized Lagrangian, and the third and
fourth equalities follow from (6) and (7), respectively. Finally, to use (5), observe that

1 Xm Xm
1 Xm
kŵk2 + αi (1 − y (i) (ŵT x(i) )) = αi + kŵk2 − ŵT αi y (i) x(i)
2 i=1 i=1 2 i=1
Xm
1
= αi + kŵk2 − kŵk2
i=1 2
Xm
1
= αi − kŵk2
i=1 2
Xm
1X m X m
= αi − αi αi y (i) y (j) hx(i) , x(j) i.
i=1 2 i=1 j=1

13

58
Therefore, our dual problem (with no more primal variables) is simply
X
m
1X m X m
maximize αi − αi αi y (i) y (j) hx(i) , x(j) i
α,β
i=1 2 i=1 j=1
subject to αi ≥ 0, i = 1, . . . , m,
βi ≥ 0, i = 1, . . . , m,
αi + βi = C, i = 1, . . . , m,
X
m
αi y (i) = 0.
i=1

2. KKT complementary. KKT complementarity requires that for any primal optimal
(w∗ , b∗ , ξ ∗ ) and dual optimal (α∗ , β ∗ ),
αi∗ (1 − ξi∗ − y (i) (w∗ T x(i) + b∗ )) = 0
βi∗ ξi∗ = 0
for i = 1, . . . , m. From the first condition, we see that if αi > 0, then in order for the
product to be zero, then 1 − ξi∗ − y (i) (w∗ T x(i) + b∗ ) = 0. It follows that
y (i) (w∗ T x(i) + b∗ ) ≤ 1
since ξ ∗ ≥ 0 by primal feasibility. Similarly, if βi∗ > 0, then ξi∗ = 0 to ensure comple-
mentarity. From the primal constraint, y (i) (wT x(i) + b) ≥ 1 − ξi , it follows that
y (i) (w∗ T x(i) + b∗ ) ≥ 1.
Finally, since βi∗ > 0 is equivalent to αi∗ < C (since α∗ + βi∗ = C), we can summarize
the KKT conditions as follows:
αi∗ = 0 ⇒ y (i) (w∗ T x(i) + b∗ ) ≥ 1,
0 < αi∗ < C ⇒ y (i) (w∗ T x(i) + b∗ ) = 1,
αi∗ = C ⇒ y (i) (w∗ T x(i) + b∗ ) ≤ 1.

3. Simplification. We can tidy up our dual problem slightly by observing that each pair
of constraints of the form
βi ≥ 0 αi + βi = C
is equivalent to the single constraint, αi ≤ C; that is, if we solve the optimization
problem
X
m
1X m X m
maximize αi − αi αi y (i) y (j) hx(i) , x(j) i
α,β
i=1 2 i=1 j=1
subject to 0 ≤ αi ≤ C, i = 1, . . . , m, (8)
X
m
αi y (i) = 0.
i=1

14

59
and subsequently set βi = C − αi , then it follows that (α, β) will be optimal for the
previous dual problem above. This last form, indeed, is the form of the soft-margin
SVM dual given in the lecture notes.

15

60
Hidden Markov Models Fundamentals
Daniel Ramage

CS229 Section Notes

December 1, 2007

Abstract
How can we apply machine learning to data that is represented as a
sequence of observations over time? For instance, we might be interested
in discovering the sequence of words that someone spoke based on an
audio recording of their speech. Or we might be interested in annotating
a sequence of words with their part-of-speech tags. These notes provides a
thorough mathematical introduction to the concept of Markov Models 
a formalism for reasoning about states over time  and Hidden Markov
Models  where we wish to recover a series of states from a series of
observations. The nal section includes some pointers to resources that
present this material from other perspectives.
1 Markov Models
Given a set of states S = {s1 , s2 , ...s|S| } we can observe a series over time
~z ∈ S T . For example, we might have the states from a weather system S =
{sun, cloud, rain} with |S| = 3 and observe the weather over a few days {z1 =
ssun , z2 = scloud , z3 = scloud , z4 = srain , z5 = scloud } with T = 5.
The observed states of our weather example represent the output of a random
process over time. Without some further assumptions, state sj at time t could
be a function of any number of variables, including all the states from times 1
to t−1 and possibly many others that we don't even model. However, we will
make two Markov assumptions that will allow us to tractably reason about
time series.
The limited horizon assumption is that the probability of being in a
state at time t depends only on the state at time t − 1. The intuition underlying
this assumption is that the state at time t represents enough summary of the
past to reasonably predict the future. Formally:

P (zt |zt−1 , zt−2 , ..., z1 ) = P (zt |zt−1 )

The stationary process assumption is that the conditional distribution


over next state given current state does not change over time. Formally:

61
P (zt |zt−1 ) = P (z2 |z1 ); t ∈ 2...T

As a convention, we will also assume that there is an initial state and initial
observation z0 ≡ s0 , where s0 represents the initial probability distribution over
states at time 0. This notational convenience allows us to encode our belief
about the prior probability of seeing the rst real state z1 as P (z1 |z0 ). Note
that P (zt |zt−1 , ..., z1 ) = P (zt |zt−1 , ..., z1 , z0 ) because we've dened z0 = s0 for
any state sequence. (Other presentations of HMMs sometimes represent these
prior believes with a vector π ∈ R|S| .)
We parametrize these transitions by dening a state transition matrix A ∈
R(|S|+1)×(|S|+1) . The value Aij is the probability of transitioning from state i
to state j at any time t. For our sun and rain example, we might have following
transition matrix:

s0 ssun scloud srain


s0 0 .33 .33 .33
A= ssun 0 .8 .1 .1
scloud 0 .2 .6 .2
srain 0 .1 .2 .7

Note that these numbers (which I made up) represent the intuition that the
weather is self-correlated: if it's sunny it will tend to stay sunny, cloudy will
stay cloudy, etc. This pattern is common in many Markov models and can
be observed as a strong diagonal in the transition matrix. Note that in this
example, our initial state s0 shows uniform probability of transitioning to each
of the three states in our weather system.

1.1 Two questions of a Markov Model

Combining the Markov assumptions with our state transition parametrization


A, we can answer two basic questions about a sequence of states in a Markov
chain. What is the probability of a particular sequence of states ~z? And how
do we estimate the parameters of our model A such to maximize the likelihood
of an observed sequence ~z?

1.1.1 Probability of a state sequence

We can compute the probability of a particular series of states ~z by use of the


chain rule of probability:

P (~z) = P (zt , zt−1 , ..., z1 ; A)


= P (zt , zt−1 , ..., z1 , z0 ; A)
= P (zt |zt−1 , zt−2 , ..., z1 ; A)P (zt−1 |zt−2 , ..., z1 ; A)...P (z1 |z0 ; A)
= P (zt |zt−1 ; A)P (zt−1 |zt−2 ; A)...P (z2 |z1 ; A)P (z1 |z0 ; A)

62

You might also like